Credit EDA
13 Jul 2024Problem Statement
This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.
In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.
Business Understanding
The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it to their advantage by becoming a defaulter. Consumer finance company which specialises in lending various types of loans to urban customers. Te task is to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.
When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:
If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.
The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:
The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,
All other cases: All other cases when the payment is paid on time.
When a client applies for a loan, there are four types of decisions that could be taken by the client/company):
Approved: The Company has approved loan Application
Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client, he received worse pricing which he did not want.
Refused: The company had rejected the loan (because the client does not meet their requirements etc.).
Unused offer: Loan has been cancelled by the client but at different stages of the process.
In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency to default.
Steps to perform EDA:
Data Cleaning
- Column Renaming
- Missing Value Imputation
- Outlier Analysis
- Incorrect values replacement
Data Imbalance Data Visualization
- Univariate Analysis
- Bivariate Analysis
- Multivaratie Analysis
Credit EDA¶
- This file comprises of all the analysis of the given dataset.
- Note: Some code cells, specifically those that generates Graphs/Charts may run for a longer time.
- Coded using VScode, Ubuntu 22-64bit, 16GB RAM, 4.5GHZ
# import all libraries numpy, pandas, matplotlib, seaborn.
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme(color_codes=True)
import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
# %matplotlib inline
# Set custom display properties in pandas
pd.set_option("display.max_rows", 900)
pd.set_option("display.max_columns", 900)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# read 'application_data.csv' and store the dataframe as 'curr_appl_data'
curr_appl_data = pd.read_csv("application_data.csv", sep=",", header=0)
# read 'previous_application.csv' and store the dataframe as 'prev_appl_data'
prev_appl_data = pd.read_csv("previous_application.csv", sep=",", header=0)
Custom Functions¶
# classify_feature_dtype function is used to distinguish numerical columns either as categorical/discrete or non categorical and return as dict
def classify_feature_dtype(df,cols):
d_categories = {'int_cat': [], "float_ts":[] }
for col in cols:
if (len(df[col].unique()) < 10):
d_categories['int_cat'].append(col)
else:
d_categories['float_ts'].append(col)
return d_categories
# show_stats function is used to print all statistical information for a given set of columns
def show_stats(df, cols):
for col in list(cols):
print("Total Nulls: {0},\nMode: {1}".format(df[col].isna().sum(), df[col].mode()[0]))
if len(df[col].unique()) < 50:
print("\nUnique: {0}\n".format(df[col].unique()))
if (df[col].dtype == int) or (df[col].dtype == float):
print("Median : {0}, \nVariance: {1}, \n\nDescribe: {2} \n".format(df[col].median(), df[col].var(), df[col].describe()))
print("ValueCounts: {0} \n\n\n".format((df[col].value_counts(normalize=True) * 100).head(5)))
print("------------------------------------------------------------------")
# check_cols_null_pct - to return the percentage of null values in each columns in a dataframe
def check_cols_null_pct(df):
df_non_na = df.count() / len(df) # Ratio of non null values
df_na_pct = (1 - df_non_na) * 100 # Find the Percentage of null values
return df_na_pct.sort_values(ascending=False) # Sort the resulting values in descending order
# univariate_plots function generates charts based on the data type of the cols, as part of the univariate analysis
# it takes dataframe, columns, train data 0,1, and feature type as args.
def univariate_plots(df, cols, t0=None, t1=None, ftype=None):
for col in cols:
#generate plots and graphs for category type. (generates piechart, countplot, boxplot / if training data is provided it generates bar chart instead)
if ftype == "category":
if (t0 is not None) and (t1 is not None):
fig, axs = plt.subplots(1, 3, figsize=(20, 6))
else:
fig, axs = plt.subplots(1, 2, figsize=(20, 6))
col_idx = 0
axs[col_idx].pie(x=df[col].value_counts().head(5), labels=df[col].value_counts().head(5).index, autopct="%1.1f%%",
radius=1, textprops={"fontsize": 10, "color": "Black"}, startangle=90, rotatelabels=False)
axs[col_idx].set_title("Pichart of {0}".format(col), y=1); plt.xticks(rotation=45); plt.ylabel("Percentage")
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
sns.countplot(data=df, y=col, order=df[col].value_counts().index, palette="viridis", ax=axs[col_idx])
axs[col_idx].set_title("Countplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col); plt.ylabel("Count")
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
if (t0 is not None) and (t1 is not None):
dff = pd.DataFrame()
dff['target_0_nodffty']= ((t0[col].value_counts())/len(t0))
dff['target_1_paydfty']= ((t1[col].value_counts())/len(t1))
dff.plot.bar(ax=axs[col_idx])
axs[col_idx].set_title('Plotting data for target in terms of percentage'); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
# else:
# sns.boxplot(data=df, y=col, palette="viridis", flierprops=dict( marker="o", markersize=6, markerfacecolor="red", markeredgecolor="black", ),
# medianprops=dict(linestyle="-", linewidth=3, color="#FF9900"), whiskerprops=dict(linestyle="-", linewidth=2, color="black"),
# capprops=dict(linestyle="-", linewidth=2, color="black"), ax=axs[col_idx])
# axs[col_idx].set_title("Boxplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
# fig.subplots_adjust(wspace=0.5, hspace=0.3)
plt.suptitle("Univariate analysis of {0}".format(col), fontsize=12, y=0.95)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
plt.clf()
#generate plots and graphs for numerical types. (generates boxplot, histplot, kdeplot, scatterplot)
elif ftype == "non_categorical":
fig, axs = plt.subplots(1, 4, figsize=(20, 6))
col_idx = 0
sns.boxplot(data=df, y=col, palette="viridis",
flierprops=dict( marker="o", markersize=6, markerfacecolor="red", markeredgecolor="black"),
medianprops=dict(linestyle="-", linewidth=3, color="#FF9900"), whiskerprops=dict(linestyle="-", linewidth=2, color="black"),
capprops=dict(linestyle="-", linewidth=2, color="black"), ax=axs[col_idx])
axs[col_idx].set_title("Boxplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
axs[col_idx].hist(data=df, x=col, label=col)
axs[col_idx].set_title("Histogram of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
sns.kdeplot(df[col], shade=True, ax=axs[col_idx])
axs[col_idx].set_title("KDE plot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
col_idx += 1
sns.scatterplot(df[col], ax=axs[col_idx])
axs[col_idx].set_title("Scatterplot of {0}".format(col)); plt.xticks(rotation=45); plt.xlabel(col)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
plt.suptitle("Univariate analysis of {0}".format(col), fontsize=12, y=0.95)
plt.tight_layout()
plt.subplots_adjust(top=0.85)
plt.show()
plt.clf()
# bivariate_plots function generates charts based on the data type of the cols, as part of the bivariate analysis
# it takes dataframe, columns, hue, train data 0,1, as args.
def bivariate_plots(df, col, hue, train0, train1):
fig = plt.figure(figsize=(20,10))
ax1 = plt.subplot(221)
df[col].value_counts().plot.pie(autopct = "%1.1f%%", radius=1.4, textprops={"fontsize": 10, "color": "Black"}, startangle=30, rotatelabels=True, ax=ax1)
plt.figtext(0.25, 0.55, 'Pie Chart for : '+ col, ha='center')
# plt.title('Plotting data for the column: '+ col, loc='right', fontdict={'verticalalignment': 'baseline', 'horizontalalignment': 'right'})
# fig.subplots_adjust(wspace=0.5, hspace=0.3)
ax2 = plt.subplot(222)
df = pd.DataFrame()
df['0']= ((train0[col].value_counts())/len(train0))
df['1']= ((train1[col].value_counts())/len(train1))
df.plot.bar(ax=ax2,)
plt.title('Plotting data for target in terms of total count')
plt.xticks(rotation=45)
fig.subplots_adjust(wspace=0.5, hspace=0.3)
ax3 = plt.subplot(223)
sns.countplot(x=col, hue=hue, data=train0, ax = ax3)
plt.xticks(rotation=45)
plt.title('Plotting data for Target=0 in terms of percentage')
fig.subplots_adjust(wspace=0.5, hspace=0.3)
ax4 = plt.subplot(224)
sns.countplot(x=col, hue=hue, data=train1, ax = ax4)
plt.xticks(rotation=45)
plt.title('Plotting data for Target=1 in terms of percentage')
fig.subplots_adjust(wspace=0.5, hspace=0.3)
plt.tight_layout()
plt.show()
plt.clf()
Data Cleaning - Current Application Dataset¶
check_cols_null_pct(curr_appl_data)
COMMONAREA_MEDI 69.872 COMMONAREA_AVG 69.872 COMMONAREA_MODE 69.872 NONLIVINGAPARTMENTS_MODE 69.433 NONLIVINGAPARTMENTS_AVG 69.433 NONLIVINGAPARTMENTS_MEDI 69.433 FONDKAPREMONT_MODE 68.386 LIVINGAPARTMENTS_MODE 68.355 LIVINGAPARTMENTS_AVG 68.355 LIVINGAPARTMENTS_MEDI 68.355 FLOORSMIN_AVG 67.849 FLOORSMIN_MODE 67.849 FLOORSMIN_MEDI 67.849 YEARS_BUILD_MEDI 66.498 YEARS_BUILD_MODE 66.498 YEARS_BUILD_AVG 66.498 OWN_CAR_AGE 65.991 LANDAREA_MEDI 59.377 LANDAREA_MODE 59.377 LANDAREA_AVG 59.377 BASEMENTAREA_MEDI 58.516 BASEMENTAREA_AVG 58.516 BASEMENTAREA_MODE 58.516 EXT_SOURCE_1 56.381 NONLIVINGAREA_MODE 55.179 NONLIVINGAREA_AVG 55.179 NONLIVINGAREA_MEDI 55.179 ELEVATORS_MEDI 53.296 ELEVATORS_AVG 53.296 ELEVATORS_MODE 53.296 WALLSMATERIAL_MODE 50.841 APARTMENTS_MEDI 50.750 APARTMENTS_AVG 50.750 APARTMENTS_MODE 50.750 ENTRANCES_MEDI 50.349 ENTRANCES_AVG 50.349 ENTRANCES_MODE 50.349 LIVINGAREA_AVG 50.193 LIVINGAREA_MODE 50.193 LIVINGAREA_MEDI 50.193 HOUSETYPE_MODE 50.176 FLOORSMAX_MODE 49.761 FLOORSMAX_MEDI 49.761 FLOORSMAX_AVG 49.761 YEARS_BEGINEXPLUATATION_MODE 48.781 YEARS_BEGINEXPLUATATION_MEDI 48.781 YEARS_BEGINEXPLUATATION_AVG 48.781 TOTALAREA_MODE 48.269 EMERGENCYSTATE_MODE 47.398 OCCUPATION_TYPE 31.346 EXT_SOURCE_3 19.825 AMT_REQ_CREDIT_BUREAU_HOUR 13.502 AMT_REQ_CREDIT_BUREAU_DAY 13.502 AMT_REQ_CREDIT_BUREAU_WEEK 13.502 AMT_REQ_CREDIT_BUREAU_MON 13.502 AMT_REQ_CREDIT_BUREAU_QRT 13.502 AMT_REQ_CREDIT_BUREAU_YEAR 13.502 NAME_TYPE_SUITE 0.420 OBS_30_CNT_SOCIAL_CIRCLE 0.332 DEF_30_CNT_SOCIAL_CIRCLE 0.332 OBS_60_CNT_SOCIAL_CIRCLE 0.332 DEF_60_CNT_SOCIAL_CIRCLE 0.332 EXT_SOURCE_2 0.215 AMT_GOODS_PRICE 0.090 AMT_ANNUITY 0.004 CNT_FAM_MEMBERS 0.001 DAYS_LAST_PHONE_CHANGE 0.000 CNT_CHILDREN 0.000 FLAG_DOCUMENT_8 0.000 NAME_CONTRACT_TYPE 0.000 CODE_GENDER 0.000 FLAG_OWN_CAR 0.000 FLAG_DOCUMENT_2 0.000 FLAG_DOCUMENT_3 0.000 FLAG_DOCUMENT_4 0.000 FLAG_DOCUMENT_5 0.000 FLAG_DOCUMENT_6 0.000 FLAG_DOCUMENT_7 0.000 FLAG_DOCUMENT_9 0.000 FLAG_DOCUMENT_21 0.000 FLAG_DOCUMENT_10 0.000 FLAG_DOCUMENT_11 0.000 FLAG_OWN_REALTY 0.000 FLAG_DOCUMENT_13 0.000 FLAG_DOCUMENT_14 0.000 FLAG_DOCUMENT_15 0.000 FLAG_DOCUMENT_16 0.000 FLAG_DOCUMENT_17 0.000 FLAG_DOCUMENT_18 0.000 FLAG_DOCUMENT_19 0.000 FLAG_DOCUMENT_20 0.000 FLAG_DOCUMENT_12 0.000 AMT_CREDIT 0.000 AMT_INCOME_TOTAL 0.000 FLAG_PHONE 0.000 LIVE_CITY_NOT_WORK_CITY 0.000 REG_CITY_NOT_WORK_CITY 0.000 TARGET 0.000 REG_CITY_NOT_LIVE_CITY 0.000 LIVE_REGION_NOT_WORK_REGION 0.000 REG_REGION_NOT_WORK_REGION 0.000 REG_REGION_NOT_LIVE_REGION 0.000 HOUR_APPR_PROCESS_START 0.000 WEEKDAY_APPR_PROCESS_START 0.000 REGION_RATING_CLIENT_W_CITY 0.000 REGION_RATING_CLIENT 0.000 FLAG_EMAIL 0.000 FLAG_CONT_MOBILE 0.000 ORGANIZATION_TYPE 0.000 FLAG_WORK_PHONE 0.000 FLAG_EMP_PHONE 0.000 FLAG_MOBIL 0.000 DAYS_ID_PUBLISH 0.000 DAYS_REGISTRATION 0.000 DAYS_EMPLOYED 0.000 DAYS_BIRTH 0.000 REGION_POPULATION_RELATIVE 0.000 NAME_HOUSING_TYPE 0.000 NAME_FAMILY_STATUS 0.000 NAME_EDUCATION_TYPE 0.000 NAME_INCOME_TYPE 0.000 SK_ID_CURR 0.000 dtype: float64
- The above outputs list contains many (~40+) columns that have null percentage greater than or equal to 50%.
# Checks the percentage of null values in each columns of a dataframe
curr_appl_na_pct = check_cols_null_pct(curr_appl_data)
curr_appl_data1 = curr_appl_data.loc[:, (curr_appl_na_pct < 50)]
curr_appl_data1.columns
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'YEARS_BEGINEXPLUATATION_AVG', 'FLOORSMAX_AVG', 'YEARS_BEGINEXPLUATATION_MODE', 'FLOORSMAX_MODE', 'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_MEDI', 'TOTALAREA_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'], dtype='object')
- Since there are many columns that have null values and are more than 50%it is generally not recommended to impute it,
- Therefore we drop all columns greater than 50%
#helper function to print all statistical information for given columns in df
show_stats(curr_appl_data1, curr_appl_data1.columns)
Total Nulls: 0, Mode: 100002 Median : 278202.0, Variance: 10565820148.159698, Describe: count 307511.000 mean 278180.519 std 102790.175 min 100002.000 25% 189145.500 50% 278202.000 75% 367142.500 max 456255.000 Name: SK_ID_CURR, dtype: float64 ValueCounts: SK_ID_CURR 100002 0.000 337664 0.000 337661 0.000 337660 0.000 337659 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [1 0] Median : 0.0, Variance: 0.07421191849651897, Describe: count 307511.000 mean 0.081 std 0.272 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: TARGET, dtype: float64 ValueCounts: TARGET 0 91.927 1 8.073 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash loans Unique: ['Cash loans' 'Revolving loans'] ValueCounts: NAME_CONTRACT_TYPE Cash loans 90.479 Revolving loans 9.521 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: F Unique: ['M' 'F' 'XNA'] ValueCounts: CODE_GENDER F 65.834 M 34.164 XNA 0.001 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: N Unique: ['N' 'Y'] ValueCounts: FLAG_OWN_CAR N 65.989 Y 34.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Y Unique: ['Y' 'N'] ValueCounts: FLAG_OWN_REALTY Y 69.367 N 30.633 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [ 0 1 2 3 4 7 5 6 8 9 11 12 10 19 14] Median : 0.0, Variance: 0.5214592938640343, Describe: count 307511.000 mean 0.417 std 0.722 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 19.000 Name: CNT_CHILDREN, dtype: float64 ValueCounts: CNT_CHILDREN 0 70.037 1 19.875 2 8.699 3 1.209 4 0.140 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 135000.0 Median : 147150.0, Variance: 56227386501.174484, Describe: count 307511.000 mean 168797.919 std 237123.146 min 25650.000 25% 112500.000 50% 147150.000 75% 202500.000 max 117000000.000 Name: AMT_INCOME_TOTAL, dtype: float64 ValueCounts: AMT_INCOME_TOTAL 135000.000 11.626 112500.000 10.087 157500.000 8.636 180000.000 8.038 90000.000 7.311 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 450000.0 Median : 513531.0, Variance: 161998825566.80032, Describe: count 307511.000 mean 599026.000 std 402490.777 min 45000.000 25% 270000.000 50% 513531.000 75% 808650.000 max 4050000.000 Name: AMT_CREDIT, dtype: float64 ValueCounts: AMT_CREDIT 450000.000 3.157 675000.000 2.887 225000.000 2.654 180000.000 2.388 270000.000 2.355 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 12, Mode: 9000.0 Median : 24903.0, Variance: 210068421.35962632, Describe: count 307499.000 mean 27108.574 std 14493.737 min 1615.500 25% 16524.000 50% 24903.000 75% 34596.000 max 258025.500 Name: AMT_ANNUITY, dtype: float64 ValueCounts: AMT_ANNUITY 9000.000 2.076 13500.000 1.793 6750.000 0.741 10125.000 0.662 37800.000 0.521 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 278, Mode: 450000.0 Median : 450000.0, Variance: 136490687205.5433, Describe: count 307233.000 mean 538396.207 std 369446.461 min 40500.000 25% 238500.000 50% 450000.000 75% 679500.000 max 4050000.000 Name: AMT_GOODS_PRICE, dtype: float64 ValueCounts: AMT_GOODS_PRICE 450000.000 8.470 225000.000 8.229 675000.000 8.125 900000.000 5.018 270000.000 3.720 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1292, Mode: Unaccompanied Unique: ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' nan 'Other_B' 'Group of people'] ValueCounts: NAME_TYPE_SUITE Unaccompanied 81.160 Family 13.111 Spouse, partner 3.713 Children 1.067 Other_B 0.578 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Working Unique: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed' 'Student' 'Businessman' 'Maternity leave'] ValueCounts: NAME_INCOME_TYPE Working 51.632 Commercial associate 23.289 Pensioner 18.003 State servant 7.058 Unemployed 0.007 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Secondary / secondary special Unique: ['Secondary / secondary special' 'Higher education' 'Incomplete higher' 'Lower secondary' 'Academic degree'] ValueCounts: NAME_EDUCATION_TYPE Secondary / secondary special 71.019 Higher education 24.345 Incomplete higher 3.342 Lower secondary 1.241 Academic degree 0.053 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Married Unique: ['Single / not married' 'Married' 'Civil marriage' 'Widow' 'Separated' 'Unknown'] ValueCounts: NAME_FAMILY_STATUS Married 63.878 Single / not married 14.778 Civil marriage 9.683 Separated 6.429 Widow 5.232 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: House / apartment Unique: ['House / apartment' 'Rented apartment' 'With parents' 'Municipal apartment' 'Office apartment' 'Co-op apartment'] ValueCounts: NAME_HOUSING_TYPE House / apartment 88.734 With parents 4.826 Municipal apartment 3.637 Rented apartment 1.587 Office apartment 0.851 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.035792 Median : 0.01885, Variance: 0.00019130430983259252, Describe: count 307511.000 mean 0.021 std 0.014 min 0.000 25% 0.010 50% 0.019 75% 0.029 max 0.073 Name: REGION_POPULATION_RELATIVE, dtype: float64 ValueCounts: REGION_POPULATION_RELATIVE 0.036 5.336 0.046 4.371 0.031 3.955 0.025 3.886 0.026 3.773 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -13749 Median : -15750.0, Variance: 19044396.778353307, Describe: count 307511.000 mean -16036.995 std 4363.989 min -25229.000 25% -19682.000 50% -15750.000 75% -12413.000 max -7489.000 Name: DAYS_BIRTH, dtype: float64 ValueCounts: DAYS_BIRTH -13749 0.014 -13481 0.014 -18248 0.013 -10020 0.013 -15771 0.013 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 365243 Median : -1213.0, Variance: 19958842205.435524, Describe: count 307511.000 mean 63815.046 std 141275.767 min -17912.000 25% -2760.000 50% -1213.000 75% -289.000 max 365243.000 Name: DAYS_EMPLOYED, dtype: float64 ValueCounts: DAYS_EMPLOYED 365243 18.007 -200 0.051 -224 0.049 -230 0.049 -199 0.049 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -1.0 Median : -4504.0, Variance: 12410728.030428572, Describe: count 307511.000 mean -4986.120 std 3522.886 min -24672.000 25% -7479.500 50% -4504.000 75% -2010.000 max 0.000 Name: DAYS_REGISTRATION, dtype: float64 ValueCounts: DAYS_REGISTRATION -1.000 0.037 -7.000 0.032 -6.000 0.031 -4.000 0.030 -2.000 0.030 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -4053 Median : -3254.0, Variance: 2278440.5674284254, Describe: count 307511.000 mean -2994.202 std 1509.450 min -7197.000 25% -4299.000 50% -3254.000 75% -1720.000 max 0.000 Name: DAYS_ID_PUBLISH, dtype: float64 ValueCounts: DAYS_ID_PUBLISH -4053 0.055 -4095 0.053 -4046 0.052 -4417 0.052 -4256 0.051 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 3.251916191640975e-06, Describe: count 307511.000 mean 1.000 std 0.002 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_MOBIL, dtype: float64 ValueCounts: FLAG_MOBIL 1 100.000 0 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.147671271296676, Describe: count 307511.000 mean 0.820 std 0.384 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_EMP_PHONE, dtype: float64 ValueCounts: FLAG_EMP_PHONE 1 81.989 0 18.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.15962120697935117, Describe: count 307511.000 mean 0.199 std 0.400 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_WORK_PHONE, dtype: float64 ValueCounts: FLAG_WORK_PHONE 0 80.063 1 19.937 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.0018631217575536686, Describe: count 307511.000 mean 0.998 std 0.043 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_CONT_MOBILE, dtype: float64 ValueCounts: FLAG_CONT_MOBILE 1 99.813 0 0.187 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [1 0] Median : 0.0, Variance: 0.20206872204588347, Describe: count 307511.000 mean 0.281 std 0.450 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 1.000 Name: FLAG_PHONE, dtype: float64 ValueCounts: FLAG_PHONE 0 71.893 1 28.107 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0535029466256677, Describe: count 307511.000 mean 0.057 std 0.231 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_EMAIL, dtype: float64 ValueCounts: FLAG_EMAIL 0 94.328 1 5.672 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 96391, Mode: Laborers Unique: ['Laborers' 'Core staff' 'Accountants' 'Managers' nan 'Drivers' 'Sales staff' 'Cleaning staff' 'Cooking staff' 'Private service staff' 'Medicine staff' 'Security staff' 'High skill tech staff' 'Waiters/barmen staff' 'Low-skill Laborers' 'Realty agents' 'Secretaries' 'IT staff' 'HR staff'] ValueCounts: OCCUPATION_TYPE Laborers 26.140 Sales staff 15.206 Core staff 13.059 Managers 10.123 Drivers 8.812 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 2, Mode: 2.0 Unique: [ 1. 2. 3. 4. 5. 6. 9. 7. 8. 10. 13. nan 14. 12. 20. 15. 16. 11.] Median : 2.0, Variance: 0.8293409204392903, Describe: count 307509.000 mean 2.153 std 0.911 min 1.000 25% 2.000 50% 2.000 75% 3.000 max 20.000 Name: CNT_FAM_MEMBERS, dtype: float64 ValueCounts: CNT_FAM_MEMBERS 2.000 51.497 1.000 22.063 3.000 17.106 4.000 8.031 5.000 1.131 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2 Unique: [2 1 3] Median : 2.0, Variance: 0.2591155142160186, Describe: count 307511.000 mean 2.052 std 0.509 min 1.000 25% 2.000 50% 2.000 75% 2.000 max 3.000 Name: REGION_RATING_CLIENT, dtype: float64 ValueCounts: REGION_RATING_CLIENT 2 73.813 3 15.717 1 10.470 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2 Unique: [2 1 3] Median : 2.0, Variance: 0.2527445242643602, Describe: count 307511.000 mean 2.032 std 0.503 min 1.000 25% 2.000 50% 2.000 75% 2.000 max 3.000 Name: REGION_RATING_CLIENT_W_CITY, dtype: float64 ValueCounts: REGION_RATING_CLIENT_W_CITY 2 74.626 3 14.263 1 11.111 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: TUESDAY Unique: ['WEDNESDAY' 'MONDAY' 'THURSDAY' 'SUNDAY' 'SATURDAY' 'FRIDAY' 'TUESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START TUESDAY 17.528 WEDNESDAY 16.889 MONDAY 16.492 THURSDAY 16.452 FRIDAY 16.369 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 10 Unique: [10 11 9 17 16 14 8 15 7 13 6 12 19 3 18 21 4 5 20 22 1 2 23 0] Median : 12.0, Variance: 10.665660320665818, Describe: count 307511.000 mean 12.063 std 3.266 min 0.000 25% 10.000 50% 12.000 75% 14.000 max 23.000 Name: HOUR_APPR_PROCESS_START, dtype: float64 ValueCounts: HOUR_APPR_PROCESS_START 10 12.267 11 12.107 12 11.132 13 10.068 14 9.002 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.014914876209068216, Describe: count 307511.000 mean 0.015 std 0.122 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_REGION_NOT_LIVE_REGION, dtype: float64 ValueCounts: REG_REGION_NOT_LIVE_REGION 0 98.486 1 1.514 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.04819158950907743, Describe: count 307511.000 mean 0.051 std 0.220 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_REGION_NOT_WORK_REGION, dtype: float64 ValueCounts: REG_REGION_NOT_WORK_REGION 0 94.923 1 5.077 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.039005704439295116, Describe: count 307511.000 mean 0.041 std 0.197 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: LIVE_REGION_NOT_WORK_REGION, dtype: float64 ValueCounts: LIVE_REGION_NOT_WORK_REGION 0 95.934 1 4.066 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.07206205892600395, Describe: count 307511.000 mean 0.078 std 0.268 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_CITY_NOT_LIVE_CITY, dtype: float64 ValueCounts: REG_CITY_NOT_LIVE_CITY 0 92.183 1 7.817 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.17734528517578357, Describe: count 307511.000 mean 0.230 std 0.421 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_CITY_NOT_WORK_CITY, dtype: float64 ValueCounts: REG_CITY_NOT_WORK_CITY 0 76.955 1 23.045 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.14731519424670028, Describe: count 307511.000 mean 0.180 std 0.384 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: LIVE_CITY_NOT_WORK_CITY, dtype: float64 ValueCounts: LIVE_CITY_NOT_WORK_CITY 0 82.045 1 17.955 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Business Entity Type 3 ValueCounts: ORGANIZATION_TYPE Business Entity Type 3 22.110 XNA 18.007 Self-employed 12.491 Other 5.425 Medicine 3.640 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 660, Mode: 0.2858978721410488 Median : 0.5659614260608526, Variance: 0.0365039828228688, Describe: count 306851.000 mean 0.514 std 0.191 min 0.000 25% 0.392 50% 0.566 75% 0.664 max 0.855 Name: EXT_SOURCE_2, dtype: float64 ValueCounts: EXT_SOURCE_2 0.286 0.235 0.262 0.136 0.265 0.112 0.160 0.105 0.265 0.100 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 60965, Mode: 0.746300213050371 Median : 0.5352762504724826, Variance: 0.03796432636328682, Describe: count 246546.000 mean 0.511 std 0.195 min 0.001 25% 0.371 50% 0.535 75% 0.669 max 0.896 Name: EXT_SOURCE_3, dtype: float64 ValueCounts: EXT_SOURCE_3 0.746 0.592 0.714 0.533 0.694 0.518 0.671 0.483 0.653 0.468 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 150007, Mode: 0.9871 Median : 0.9816, Variance: 0.0035074009635889924, Describe: count 157504.000 mean 0.978 std 0.059 min 0.000 25% 0.977 50% 0.982 75% 0.987 max 1.000 Name: YEARS_BEGINEXPLUATATION_AVG, dtype: float64 ValueCounts: YEARS_BEGINEXPLUATATION_AVG 0.987 2.737 0.986 2.660 0.986 2.648 0.980 2.618 0.987 2.612 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 153020, Mode: 0.1667 Median : 0.1667, Variance: 0.020920931965718682, Describe: count 154491.000 mean 0.226 std 0.145 min 0.000 25% 0.167 50% 0.167 75% 0.333 max 1.000 Name: FLOORSMAX_AVG, dtype: float64 ValueCounts: FLOORSMAX_AVG 0.167 40.051 0.333 20.654 0.042 9.450 0.375 5.130 0.125 4.514 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 150007, Mode: 0.9871 Median : 0.9816, Variance: 0.004169987074137329, Describe: count 157504.000 mean 0.977 std 0.065 min 0.000 25% 0.977 50% 0.982 75% 0.987 max 1.000 Name: YEARS_BEGINEXPLUATATION_MODE, dtype: float64 ValueCounts: YEARS_BEGINEXPLUATATION_MODE 0.987 2.724 0.987 2.649 0.986 2.646 0.980 2.609 0.981 2.592 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 153020, Mode: 0.1667 Unique: [0.0833 0.2917 nan 0.1667 0.3333 0.6667 0.375 0.0417 0.25 0.4583 0.2083 0.125 0. 0.5833 0.625 0.9167 0.9583 0.5417 1. 0.4167 0.875 0.7083 0.75 0.5 0.7917 0.8333] Median : 0.1667, Variance: 0.020652393543979256, Describe: count 154491.000 mean 0.222 std 0.144 min 0.000 25% 0.167 50% 0.167 75% 0.333 max 1.000 Name: FLOORSMAX_MODE, dtype: float64 ValueCounts: FLOORSMAX_MODE 0.167 42.430 0.333 22.249 0.042 10.108 0.375 5.386 0.125 4.704 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 150007, Mode: 0.9871 Median : 0.9816, Variance: 0.003587688764103631, Describe: count 157504.000 mean 0.978 std 0.060 min 0.000 25% 0.977 50% 0.982 75% 0.987 max 1.000 Name: YEARS_BEGINEXPLUATATION_MEDI, dtype: float64 ValueCounts: YEARS_BEGINEXPLUATATION_MEDI 0.987 2.739 0.986 2.696 0.986 2.666 0.987 2.627 0.980 2.613 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 153020, Mode: 0.1667 Median : 0.1667, Variance: 0.02104444200909498, Describe: count 154491.000 mean 0.226 std 0.145 min 0.000 25% 0.167 50% 0.167 75% 0.333 max 1.000 Name: FLOORSMAX_MEDI, dtype: float64 ValueCounts: FLOORSMAX_MEDI 0.167 41.172 0.333 21.541 0.042 9.601 0.375 5.312 0.125 4.582 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 148431, Mode: 0.0 Median : 0.0688, Variance: 0.01154815111163752, Describe: count 159080.000 mean 0.103 std 0.107 min 0.000 25% 0.041 50% 0.069 75% 0.128 max 1.000 Name: TOTALAREA_MODE, dtype: float64 ValueCounts: TOTALAREA_MODE 0.000 0.366 0.057 0.155 0.055 0.145 0.055 0.143 0.056 0.143 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 145755, Mode: No Unique: ['No' nan 'Yes'] ValueCounts: EMERGENCYSTATE_MODE No 98.561 Yes 1.439 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1021, Mode: 0.0 Unique: [ 2. 1. 0. 4. 8. 10. nan 7. 3. 6. 5. 12. 9. 13. 11. 14. 22. 16. 15. 17. 20. 25. 19. 18. 21. 24. 23. 28. 26. 29. 27. 47. 348. 30.] Median : 0.0, Variance: 5.764746958955638, Describe: count 306490.000 mean 1.422 std 2.401 min 0.000 25% 0.000 50% 0.000 75% 2.000 max 348.000 Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: OBS_30_CNT_SOCIAL_CIRCLE 0.000 53.480 1.000 15.917 2.000 9.726 3.000 6.631 4.000 4.615 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1021, Mode: 0.0 Unique: [ 2. 0. 1. nan 3. 4. 5. 6. 7. 34. 8.] Median : 0.0, Variance: 0.19953948681282563, Describe: count 306490.000 mean 0.143 std 0.447 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 34.000 Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: DEF_30_CNT_SOCIAL_CIRCLE 0.000 88.526 1.000 9.243 2.000 1.737 3.000 0.389 4.000 0.083 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1021, Mode: 0.0 Unique: [ 2. 1. 0. 4. 8. 10. nan 7. 3. 6. 5. 12. 9. 13. 11. 14. 21. 15. 22. 16. 20. 25. 17. 19. 18. 24. 23. 28. 29. 27. 47. 344. 30. 26.] Median : 0.0, Variance: 5.663463994080467, Describe: count 306490.000 mean 1.405 std 2.380 min 0.000 25% 0.000 50% 0.000 75% 2.000 max 344.000 Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: OBS_60_CNT_SOCIAL_CIRCLE 0.000 53.726 1.000 15.945 2.000 9.712 3.000 6.596 4.000 4.550 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1021, Mode: 0.0 Unique: [ 2. 0. 1. nan 3. 5. 4. 7. 24. 6.] Median : 0.0, Variance: 0.131254626645024, Describe: count 306490.000 mean 0.100 std 0.362 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 24.000 Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: DEF_60_CNT_SOCIAL_CIRCLE 0.000 91.592 1.000 7.126 2.000 1.034 3.000 0.195 4.000 0.044 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1, Mode: 0.0 Median : -757.0, Variance: 683612.2742421023, Describe: count 307510.000 mean -962.859 std 826.808 min -4292.000 25% -1570.000 50% -757.000 75% -274.000 max 0.000 Name: DAYS_LAST_PHONE_CHANGE, dtype: float64 ValueCounts: DAYS_LAST_PHONE_CHANGE 0.000 12.251 -1.000 0.914 -2.000 0.754 -3.000 0.573 -4.000 0.418 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 4.227326079183737e-05, Describe: count 307511.000 mean 0.000 std 0.007 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_2, dtype: float64 ValueCounts: FLAG_DOCUMENT_2 0 99.996 1 0.004 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.20589084885614972, Describe: count 307511.000 mean 0.710 std 0.454 min 0.000 25% 0.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_DOCUMENT_3, dtype: float64 ValueCounts: FLAG_DOCUMENT_3 1 71.002 0 28.998 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 8.129155979353892e-05, Describe: count 307511.000 mean 0.000 std 0.009 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_4, dtype: float64 ValueCounts: FLAG_DOCUMENT_4 0 99.992 1 0.008 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.014886494471071949, Describe: count 307511.000 mean 0.015 std 0.122 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_5, dtype: float64 ValueCounts: FLAG_DOCUMENT_5 0 98.489 1 1.511 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.08030189665614886, Describe: count 307511.000 mean 0.088 std 0.283 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_6, dtype: float64 ValueCounts: FLAG_DOCUMENT_6 0 91.194 1 8.806 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00019182686767647077, Describe: count 307511.000 mean 0.000 std 0.014 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_7, dtype: float64 ValueCounts: FLAG_DOCUMENT_7 0 99.981 1 0.019 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.07475414850818889, Describe: count 307511.000 mean 0.081 std 0.273 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_8, dtype: float64 ValueCounts: FLAG_DOCUMENT_8 0 91.862 1 8.138 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0038806309936899315, Describe: count 307511.000 mean 0.004 std 0.062 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_9, dtype: float64 ValueCounts: FLAG_DOCUMENT_9 0 99.610 1 0.390 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 2.276296919164723e-05, Describe: count 307511.000 mean 0.000 std 0.005 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_10, dtype: float64 ValueCounts: FLAG_DOCUMENT_10 0 99.998 1 0.002 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0038967636747465713, Describe: count 307511.000 mean 0.004 std 0.062 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_11, dtype: float64 ValueCounts: FLAG_DOCUMENT_11 0 99.609 1 0.391 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 6.5038112332850705e-06, Describe: count 307511.000 mean 0.000 std 0.003 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_12, dtype: float64 ValueCounts: FLAG_DOCUMENT_12 0 99.999 1 0.001 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.003512662405688219, Describe: count 307511.000 mean 0.004 std 0.059 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_13, dtype: float64 ValueCounts: FLAG_DOCUMENT_13 0 99.647 1 0.353 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0029278669255157, Describe: count 307511.000 mean 0.003 std 0.054 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_14, dtype: float64 ValueCounts: FLAG_DOCUMENT_14 0 99.706 1 0.294 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0012082533473030956, Describe: count 307511.000 mean 0.001 std 0.035 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_15, dtype: float64 ValueCounts: FLAG_DOCUMENT_15 0 99.879 1 0.121 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.009829564925667255, Describe: count 307511.000 mean 0.010 std 0.099 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_16, dtype: float64 ValueCounts: FLAG_DOCUMENT_16 0 99.007 1 0.993 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00026658688860710627, Describe: count 307511.000 mean 0.000 std 0.016 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_17, dtype: float64 ValueCounts: FLAG_DOCUMENT_17 0 99.973 1 0.027 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00806372320835816, Describe: count 307511.000 mean 0.008 std 0.090 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_18, dtype: float64 ValueCounts: FLAG_DOCUMENT_18 0 99.187 1 0.813 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0005947484523376042, Describe: count 307511.000 mean 0.001 std 0.024 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_19, dtype: float64 ValueCounts: FLAG_DOCUMENT_19 0 99.940 1 0.060 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0005070432225544667, Describe: count 307511.000 mean 0.001 std 0.023 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_20, dtype: float64 ValueCounts: FLAG_DOCUMENT_20 0 99.949 1 0.051 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00033483626685604196, Describe: count 307511.000 mean 0.000 std 0.018 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_21, dtype: float64 ValueCounts: FLAG_DOCUMENT_21 0 99.967 1 0.033 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 41519, Mode: 0.0 Unique: [ 0. nan 1. 2. 3. 4.] Median : 0.0, Variance: 0.007030676341450841, Describe: count 265992.000 mean 0.006 std 0.084 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 4.000 Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_HOUR 0.000 99.389 1.000 0.586 2.000 0.021 3.000 0.003 4.000 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 41519, Mode: 0.0 Unique: [ 0. nan 1. 3. 2. 4. 5. 6. 9. 8.] Median : 0.0, Variance: 0.012267203055661324, Describe: count 265992.000 mean 0.007 std 0.111 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 9.000 Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_DAY 0.000 99.440 1.000 0.486 2.000 0.040 3.000 0.017 4.000 0.010 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 41519, Mode: 0.0 Unique: [ 0. nan 1. 3. 2. 4. 5. 6. 8. 7.] Median : 0.0, Variance: 0.04189589838652147, Describe: count 265992.000 mean 0.034 std 0.205 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 8.000 Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_WEEK 0.000 96.791 1.000 3.086 2.000 0.075 3.000 0.022 4.000 0.013 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 41519, Mode: 0.0 Unique: [ 0. nan 1. 2. 6. 5. 3. 7. 9. 4. 11. 8. 16. 12. 14. 10. 13. 17. 24. 19. 15. 23. 18. 27. 22.] Median : 0.0, Variance: 0.8390603897599506, Describe: count 265992.000 mean 0.267 std 0.916 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 27.000 Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_MON 0.000 83.549 1.000 12.462 2.000 2.025 3.000 0.749 4.000 0.405 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 41519, Mode: 0.0 Unique: [ 0. nan 1. 2. 4. 3. 8. 5. 6. 7. 261. 19.] Median : 0.0, Variance: 0.6305243726283989, Describe: count 265992.000 mean 0.265 std 0.794 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 261.000 Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_QRT 0.000 80.986 1.000 12.730 2.000 5.418 3.000 0.646 4.000 0.179 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 41519, Mode: 0.0 Unique: [ 1. 0. nan 2. 4. 5. 3. 8. 6. 9. 7. 10. 11. 13. 16. 12. 25. 23. 15. 14. 22. 17. 19. 18. 21. 20.] Median : 1.0, Variance: 3.4942637902290734, Describe: count 265992.000 mean 1.900 std 1.869 min 0.000 25% 0.000 50% 1.000 75% 3.000 max 25.000 Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_YEAR 0.000 26.994 1.000 23.837 2.000 18.870 3.000 12.642 4.000 7.787 Name: proportion, dtype: float64 ------------------------------------------------------------------
DAYS_COLS = curr_appl_data1.filter(like="DAYS").columns # select all cols that contain DAYS
FLAG_COLS = curr_appl_data1.filter(like="FLAG").columns # select all cols that contain FLAG
FLOORSMAX_COLS = curr_appl_data1.columns[curr_appl_data1.columns.str.contains("FLOORSMAX")] # select all cols that contains FLOORSMAX
BEGINEXPLUATATION_COLS = curr_appl_data1.columns[curr_appl_data1.columns.str.contains("BEGINEXPLUATATION")] # select all cols that contains BEGINEXPLUATATION
AMT_REQ_CREDIT_BUREAU_COLS = curr_appl_data1.columns[curr_appl_data1.columns.str.contains("AMT_REQ_CREDIT_BUREAU")] # select all cols that contains AMT_REQ_CREDIT_BUREAU
# curr_appl_data1["FLOORSMAX"] = curr_appl_data1[FLOORSMAX_COLS].sum(axis=1, skipna=False) / len(FLOORSMAX_COLS)
# curr_appl_data1["YEARS_BEGINEXPLOITATION"] = curr_appl_data1.loc[:, BEGINEXPLUATATION_COLS].sum(axis=1, skipna=False) / len(BEGINEXPLUATATION_COLS)
curr_appl_data1["FLOORSMAX"] = curr_appl_data1[FLOORSMAX_COLS].mean(axis=1, skipna=False)
curr_appl_data1["YEARS_BEGINEXPLOITATION"] = curr_appl_data1.loc[:, BEGINEXPLUATATION_COLS].mean(axis=1, skipna=False)
- The above features are generally considred additional features ,
- In order to minimize the no of cols, clubbed similar cols & created single feature col (considered using mean among Min/Max/Mean/Median, interative approach)
_ = [curr_appl_data1[col].fillna(curr_appl_data1[col].mode()[0], inplace=True) for col in AMT_REQ_CREDIT_BUREAU_COLS]
curr_appl_data1["FLOORSMAX"] = curr_appl_data1["FLOORSMAX"].fillna( curr_appl_data1["FLOORSMAX"].median())
curr_appl_data1["EXT_SOURCE_2"] = curr_appl_data1["EXT_SOURCE_2"].fillna( curr_appl_data1["EXT_SOURCE_2"].median())
curr_appl_data1["EXT_SOURCE_3"] = curr_appl_data1["EXT_SOURCE_3"].fillna( curr_appl_data1["EXT_SOURCE_3"].median())
curr_appl_data1["TOTALAREA_MODE"] = curr_appl_data1["TOTALAREA_MODE"].fillna( curr_appl_data1["TOTALAREA_MODE"].mean())
curr_appl_data1["NAME_TYPE_SUITE"] = curr_appl_data1["NAME_TYPE_SUITE"].fillna( curr_appl_data1["NAME_TYPE_SUITE"].mode()[0])
curr_appl_data1["CNT_FAM_MEMBERS"] = curr_appl_data1["CNT_FAM_MEMBERS"].fillna( curr_appl_data1["CNT_FAM_MEMBERS"].mode()[0])
curr_appl_data1["EMERGENCYSTATE_MODE"] = curr_appl_data1["EMERGENCYSTATE_MODE"].fillna( curr_appl_data1["EMERGENCYSTATE_MODE"].mode()[0])
curr_appl_data1["DAYS_LAST_PHONE_CHANGE"] = curr_appl_data1["DAYS_LAST_PHONE_CHANGE"].fillna(curr_appl_data1["DAYS_LAST_PHONE_CHANGE"].median())
curr_appl_data1["YEARS_BEGINEXPLOITATION"] = curr_appl_data1["YEARS_BEGINEXPLOITATION"].fillna(curr_appl_data1["YEARS_BEGINEXPLOITATION"].mean())
curr_appl_data1["OBS_30_CNT_SOCIAL_CIRCLE"] = curr_appl_data1["OBS_30_CNT_SOCIAL_CIRCLE"].fillna(curr_appl_data1["OBS_30_CNT_SOCIAL_CIRCLE"].mode()[0])
curr_appl_data1["OBS_60_CNT_SOCIAL_CIRCLE"] = curr_appl_data1["OBS_60_CNT_SOCIAL_CIRCLE"].fillna(curr_appl_data1["OBS_60_CNT_SOCIAL_CIRCLE"].mode()[0])
curr_appl_data1["DEF_30_CNT_SOCIAL_CIRCLE"] = curr_appl_data1["DEF_30_CNT_SOCIAL_CIRCLE"].fillna(curr_appl_data1["DEF_30_CNT_SOCIAL_CIRCLE"].mode()[0])
curr_appl_data1["DEF_60_CNT_SOCIAL_CIRCLE"] = curr_appl_data1["DEF_60_CNT_SOCIAL_CIRCLE"].fillna(curr_appl_data1["DEF_60_CNT_SOCIAL_CIRCLE"].mode()[0])
# curr_appl_data1 = curr_appl_data1[~curr_appl_data1[["AMT_ANNUITY", "AMT_GOODS_PRICE"]].isna().any(axis=1)]
curr_appl_data1["AMT_ANNUITY"] = curr_appl_data1["AMT_ANNUITY"].fillna( curr_appl_data1["AMT_ANNUITY"].mean())
curr_appl_data1["AMT_GOODS_PRICE"] = curr_appl_data1["AMT_GOODS_PRICE"].fillna( curr_appl_data1["AMT_GOODS_PRICE"].mean())
- since there are many null values among cols, tried to impute it based on stats for the specific columns.
- for categorical/discrete columns, considered Mode if the frequency of the topmost value is above 50%.
- for numerical columns - considered Mean, Median, based on distribution and skewness.
- AMT_ANNUITY, AMT_GOODS_PRICE are considered critical features,
- To impute such features, we need to identify whether it is MAR, MNAR, MCAR. also to analyse dependency cols and distribution set to finetune the imputation.
- for this analysis, since the missing values are negligible (>200) values , we considered it to be of less impact to the outcome of the analysis,
- To keep the dataset as it is, we therefore decided to fill null values with the mean of the column, instead of dropping it.
curr_appl_data1 = curr_appl_data1.drop(FLOORSMAX_COLS, axis=1)
curr_appl_data1 = curr_appl_data1.drop(BEGINEXPLUATATION_COLS, axis=1)
- FLOORSMAX_COLS, BEGINEXPLUATATION_COLS collections are featured engineered and hence reduntant, therefore we dropped those columns
# curr_appl_data1["EMERGENCYSTATE_MODE"] = curr_appl_data1["EMERGENCYSTATE_MODE"].replace(to_replace=["Yes", "No"], value=[1, 0])
curr_appl_data1["FLAG_OWN_CAR"] = curr_appl_data1["FLAG_OWN_CAR"].replace(to_replace=["Y", "N"], value=["Yes", "No"])
curr_appl_data1["FLAG_OWN_REALTY"] = curr_appl_data1["FLAG_OWN_REALTY"].replace(to_replace=["Y", "N"], value=["Yes", "No"])
curr_appl_data1["NAME_TYPE_SUITE"] = curr_appl_data1["NAME_TYPE_SUITE"].replace(to_replace="Spouse, partner", value="Spouse")
curr_appl_data1["NAME_TYPE_SUITE"] = curr_appl_data1["NAME_TYPE_SUITE"].replace(to_replace=["Other_A", "Other_B"], value=["Others", "Others"])
curr_appl_data1["NAME_HOUSING_TYPE"] = curr_appl_data1["NAME_HOUSING_TYPE"].replace(to_replace="House / apartment", value="House")
curr_appl_data1["NAME_FAMILY_STATUS"] = curr_appl_data1["NAME_FAMILY_STATUS"].replace(to_replace="Single / not married", value="Single")
curr_appl_data1["NAME_EDUCATION_TYPE"] = curr_appl_data1["NAME_EDUCATION_TYPE"].replace(to_replace="Secondary / secondary special", value="Secondary")
- Some categorical column values are more detailed, therefore replaced with more apt and common terms
curr_appl_data1[DAYS_COLS] = (abs(curr_appl_data1[DAYS_COLS]) / 365).astype(int)
- DAYS_COLS - Is a list of days type columns, represented in days count,
- There were inconsistencies, therefore made all values absolute, then converted it to years count by dividing it with 365.
days_to_years = {i: i.replace("DAYS", "YEARS") for i in DAYS_COLS}
curr_appl_data1 = curr_appl_data1.rename(columns=days_to_years)
curr_appl_data1.rename(columns={'YEARS_BIRTH': 'AGE_IN_YEARS'})
YEARS_COLS = curr_appl_data1.filter(like="YEARS").columns
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | AGE_IN_YEARS | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | YEARS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | FLOORSMAX | YEARS_BEGINEXPLOITATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | No | Yes | 0 | 202500.000 | 406597.500 | 24700.500 | 351000.000 | Unaccompanied | Working | Secondary | Single | House | 0.019 | 25 | 1 | 9 | 5 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.000 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.263 | 0.139 | 0.015 | No | 2.000 | 2.000 | 2.000 | 2.000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.083 | 0.972 |
1 | 100003 | 0 | Cash loans | F | No | No | 0 | 270000.000 | 1293502.500 | 35698.500 | 1129500.000 | Family | State servant | Higher education | Married | House | 0.004 | 45 | 3 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.000 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622 | 0.535 | 0.071 | No | 1.000 | 0.000 | 1.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.292 | 0.985 |
2 | 100004 | 0 | Revolving loans | M | Yes | Yes | 0 | 67500.000 | 135000.000 | 6750.000 | 135000.000 | Unaccompanied | Working | Secondary | Single | House | 0.010 | 52 | 0 | 11 | 6 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.000 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.556 | 0.730 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 |
3 | 100006 | 0 | Cash loans | F | No | Yes | 0 | 135000.000 | 312682.500 | 29686.500 | 297000.000 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.008 | 52 | 8 | 26 | 6 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.000 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650 | 0.535 | 0.103 | No | 2.000 | 0.000 | 2.000 | 0.000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 |
4 | 100007 | 0 | Cash loans | M | No | Yes | 0 | 121500.000 | 513000.000 | 21865.500 | 513000.000 | Unaccompanied | Working | Secondary | Single | House | 0.029 | 54 | 8 | 11 | 9 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.000 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.323 | 0.535 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 456251 | 0 | Cash loans | M | No | No | 0 | 157500.000 | 254700.000 | 27558.000 | 225000.000 | Unaccompanied | Working | Secondary | Separated | With parents | 0.033 | 25 | 0 | 23 | 5 | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1.000 | 1 | 1 | THURSDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.682 | 0.535 | 0.290 | No | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.556 | 0.984 |
307507 | 456252 | 0 | Cash loans | F | No | Yes | 0 | 72000.000 | 269550.000 | 12001.500 | 225000.000 | Unaccompanied | Pensioner | Secondary | Widow | House | 0.025 | 56 | 1000 | 12 | 11 | 1 | 0 | 0 | 1 | 1 | 0 | NaN | 1.000 | 2 | 2 | MONDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.116 | 0.535 | 0.021 | No | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.083 | 0.973 |
307508 | 456253 | 0 | Cash loans | F | No | Yes | 0 | 153000.000 | 677664.000 | 29979.000 | 585000.000 | Unaccompanied | Working | Higher education | Separated | House | 0.005 | 41 | 21 | 18 | 14 | 1 | 1 | 0 | 1 | 0 | 1 | Managers | 1.000 | 3 | 3 | THURSDAY | 9 | 0 | 0 | 0 | 0 | 1 | 1 | School | 0.536 | 0.219 | 0.797 | No | 6.000 | 0.000 | 6.000 | 0.000 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.167 | 0.982 |
307509 | 456254 | 1 | Cash loans | F | No | Yes | 0 | 171000.000 | 370107.000 | 20205.000 | 319500.000 | Unaccompanied | Commercial associate | Secondary | Married | House | 0.005 | 32 | 13 | 7 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.000 | 2 | 2 | WEDNESDAY | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | 0.514 | 0.661 | 0.009 | No | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.042 | 0.977 |
307510 | 456255 | 0 | Cash loans | F | No | No | 0 | 157500.000 | 675000.000 | 49117.500 | 675000.000 | Unaccompanied | Commercial associate | Higher education | Married | House | 0.046 | 46 | 3 | 14 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.000 | 1 | 1 | THURSDAY | 20 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.709 | 0.114 | 0.072 | No | 0.000 | 0.000 | 0.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 2.000 | 0.000 | 1.000 | 0.375 | 0.988 |
307511 rows × 77 columns
- Renamed DAYS columns to YEARS , since the values are already converted
curr_appl_na_pct1 = check_cols_null_pct(curr_appl_data1)
curr_appl_na_pct1
show_stats(curr_appl_data1, curr_appl_data1.columns)
OCCUPATION_TYPE 31.346 SK_ID_CURR 0.000 FLAG_DOCUMENT_2 0.000 FLAG_DOCUMENT_8 0.000 FLAG_DOCUMENT_7 0.000 FLAG_DOCUMENT_6 0.000 FLAG_DOCUMENT_5 0.000 FLAG_DOCUMENT_4 0.000 FLAG_DOCUMENT_3 0.000 YEARS_LAST_PHONE_CHANGE 0.000 FLAG_DOCUMENT_10 0.000 DEF_60_CNT_SOCIAL_CIRCLE 0.000 OBS_60_CNT_SOCIAL_CIRCLE 0.000 DEF_30_CNT_SOCIAL_CIRCLE 0.000 OBS_30_CNT_SOCIAL_CIRCLE 0.000 EMERGENCYSTATE_MODE 0.000 TOTALAREA_MODE 0.000 EXT_SOURCE_3 0.000 FLAG_DOCUMENT_9 0.000 FLAG_DOCUMENT_11 0.000 ORGANIZATION_TYPE 0.000 FLAG_DOCUMENT_21 0.000 FLOORSMAX 0.000 AMT_REQ_CREDIT_BUREAU_YEAR 0.000 AMT_REQ_CREDIT_BUREAU_QRT 0.000 AMT_REQ_CREDIT_BUREAU_MON 0.000 AMT_REQ_CREDIT_BUREAU_WEEK 0.000 AMT_REQ_CREDIT_BUREAU_DAY 0.000 AMT_REQ_CREDIT_BUREAU_HOUR 0.000 FLAG_DOCUMENT_20 0.000 FLAG_DOCUMENT_12 0.000 FLAG_DOCUMENT_19 0.000 FLAG_DOCUMENT_18 0.000 FLAG_DOCUMENT_17 0.000 FLAG_DOCUMENT_16 0.000 FLAG_DOCUMENT_15 0.000 FLAG_DOCUMENT_14 0.000 FLAG_DOCUMENT_13 0.000 EXT_SOURCE_2 0.000 LIVE_CITY_NOT_WORK_CITY 0.000 TARGET 0.000 YEARS_EMPLOYED 0.000 REGION_POPULATION_RELATIVE 0.000 NAME_HOUSING_TYPE 0.000 NAME_FAMILY_STATUS 0.000 NAME_EDUCATION_TYPE 0.000 NAME_INCOME_TYPE 0.000 NAME_TYPE_SUITE 0.000 AMT_GOODS_PRICE 0.000 AMT_ANNUITY 0.000 AMT_CREDIT 0.000 AMT_INCOME_TOTAL 0.000 CNT_CHILDREN 0.000 FLAG_OWN_REALTY 0.000 FLAG_OWN_CAR 0.000 CODE_GENDER 0.000 NAME_CONTRACT_TYPE 0.000 YEARS_BIRTH 0.000 YEARS_REGISTRATION 0.000 REG_CITY_NOT_WORK_CITY 0.000 YEARS_ID_PUBLISH 0.000 REG_CITY_NOT_LIVE_CITY 0.000 LIVE_REGION_NOT_WORK_REGION 0.000 REG_REGION_NOT_WORK_REGION 0.000 REG_REGION_NOT_LIVE_REGION 0.000 HOUR_APPR_PROCESS_START 0.000 WEEKDAY_APPR_PROCESS_START 0.000 REGION_RATING_CLIENT_W_CITY 0.000 REGION_RATING_CLIENT 0.000 CNT_FAM_MEMBERS 0.000 FLAG_EMAIL 0.000 FLAG_PHONE 0.000 FLAG_CONT_MOBILE 0.000 FLAG_WORK_PHONE 0.000 FLAG_EMP_PHONE 0.000 FLAG_MOBIL 0.000 YEARS_BEGINEXPLOITATION 0.000 dtype: float64
Total Nulls: 0, Mode: 100002 Median : 278202.0, Variance: 10565820148.159698, Describe: count 307511.000 mean 278180.519 std 102790.175 min 100002.000 25% 189145.500 50% 278202.000 75% 367142.500 max 456255.000 Name: SK_ID_CURR, dtype: float64 ValueCounts: SK_ID_CURR 100002 0.000 337664 0.000 337661 0.000 337660 0.000 337659 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [1 0] Median : 0.0, Variance: 0.07421191849651897, Describe: count 307511.000 mean 0.081 std 0.272 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: TARGET, dtype: float64 ValueCounts: TARGET 0 91.927 1 8.073 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash loans Unique: ['Cash loans' 'Revolving loans'] ValueCounts: NAME_CONTRACT_TYPE Cash loans 90.479 Revolving loans 9.521 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: F Unique: ['M' 'F' 'XNA'] ValueCounts: CODE_GENDER F 65.834 M 34.164 XNA 0.001 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: No Unique: ['No' 'Yes'] ValueCounts: FLAG_OWN_CAR No 65.989 Yes 34.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Yes Unique: ['Yes' 'No'] ValueCounts: FLAG_OWN_REALTY Yes 69.367 No 30.633 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [ 0 1 2 3 4 7 5 6 8 9 11 12 10 19 14] Median : 0.0, Variance: 0.5214592938640343, Describe: count 307511.000 mean 0.417 std 0.722 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 19.000 Name: CNT_CHILDREN, dtype: float64 ValueCounts: CNT_CHILDREN 0 70.037 1 19.875 2 8.699 3 1.209 4 0.140 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 135000.0 Median : 147150.0, Variance: 56227386501.174484, Describe: count 307511.000 mean 168797.919 std 237123.146 min 25650.000 25% 112500.000 50% 147150.000 75% 202500.000 max 117000000.000 Name: AMT_INCOME_TOTAL, dtype: float64 ValueCounts: AMT_INCOME_TOTAL 135000.000 11.626 112500.000 10.087 157500.000 8.636 180000.000 8.038 90000.000 7.311 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 450000.0 Median : 513531.0, Variance: 161998825566.80032, Describe: count 307511.000 mean 599026.000 std 402490.777 min 45000.000 25% 270000.000 50% 513531.000 75% 808650.000 max 4050000.000 Name: AMT_CREDIT, dtype: float64 ValueCounts: AMT_CREDIT 450000.000 3.157 675000.000 2.887 225000.000 2.654 180000.000 2.388 270000.000 2.355 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 9000.0 Median : 24903.0, Variance: 210060223.83415946, Describe: count 307511.000 mean 27108.574 std 14493.455 min 1615.500 25% 16524.000 50% 24903.000 75% 34596.000 max 258025.500 Name: AMT_ANNUITY, dtype: float64 ValueCounts: AMT_ANNUITY 9000.000 2.076 13500.000 1.793 6750.000 0.741 10125.000 0.662 37800.000 0.521 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 450000.0 Median : 450000.0, Variance: 136367294759.62889, Describe: count 307511.000 mean 538396.207 std 369279.426 min 40500.000 25% 238500.000 50% 450000.000 75% 679500.000 max 4050000.000 Name: AMT_GOODS_PRICE, dtype: float64 ValueCounts: AMT_GOODS_PRICE 450000.000 8.462 225000.000 8.221 675000.000 8.117 900000.000 5.013 270000.000 3.716 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Unaccompanied Unique: ['Unaccompanied' 'Family' 'Spouse' 'Children' 'Others' 'Group of people'] ValueCounts: NAME_TYPE_SUITE Unaccompanied 81.239 Family 13.056 Spouse 3.697 Children 1.062 Others 0.857 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Working Unique: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed' 'Student' 'Businessman' 'Maternity leave'] ValueCounts: NAME_INCOME_TYPE Working 51.632 Commercial associate 23.289 Pensioner 18.003 State servant 7.058 Unemployed 0.007 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Secondary Unique: ['Secondary' 'Higher education' 'Incomplete higher' 'Lower secondary' 'Academic degree'] ValueCounts: NAME_EDUCATION_TYPE Secondary 71.019 Higher education 24.345 Incomplete higher 3.342 Lower secondary 1.241 Academic degree 0.053 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Married Unique: ['Single' 'Married' 'Civil marriage' 'Widow' 'Separated' 'Unknown'] ValueCounts: NAME_FAMILY_STATUS Married 63.878 Single 14.778 Civil marriage 9.683 Separated 6.429 Widow 5.232 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: House Unique: ['House' 'Rented apartment' 'With parents' 'Municipal apartment' 'Office apartment' 'Co-op apartment'] ValueCounts: NAME_HOUSING_TYPE House 88.734 With parents 4.826 Municipal apartment 3.637 Rented apartment 1.587 Office apartment 0.851 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.035792 Median : 0.01885, Variance: 0.00019130430983259252, Describe: count 307511.000 mean 0.021 std 0.014 min 0.000 25% 0.010 50% 0.019 75% 0.029 max 0.073 Name: REGION_POPULATION_RELATIVE, dtype: float64 ValueCounts: REGION_POPULATION_RELATIVE 0.036 5.336 0.046 4.371 0.031 3.955 0.025 3.886 0.026 3.773 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 38 Median : 43.0, Variance: 142.9122920578924, Describe: count 307511.000 mean 43.436 std 11.955 min 20.000 25% 34.000 50% 43.000 75% 53.000 max 69.000 Name: YEARS_BIRTH, dtype: float64 ValueCounts: YEARS_BIRTH 38 2.885 37 2.861 39 2.852 40 2.804 36 2.801 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1000 Median : 6.0, Variance: 145902.7541267848, Describe: count 307511.000 mean 185.022 std 381.972 min 0.000 25% 2.000 50% 6.000 75% 15.000 max 1000.000 Name: YEARS_EMPLOYED, dtype: float64 ValueCounts: YEARS_EMPLOYED 1000 18.007 1 10.354 2 9.641 0 9.074 3 8.165 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Median : 12.0, Variance: 93.06154614411118, Describe: count 307511.000 mean 13.169 std 9.647 min 0.000 25% 5.000 50% 12.000 75% 20.000 max 67.000 Name: YEARS_REGISTRATION, dtype: float64 ValueCounts: YEARS_REGISTRATION 0 5.482 1 5.065 2 4.877 3 4.225 12 4.214 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11 Unique: [ 5 0 6 9 1 10 2 8 13 3 4 7 12 11 14 16 15 17 19] Median : 8.0, Variance: 17.094215357427704, Describe: count 307511.000 mean 7.713 std 4.135 min 0.000 25% 4.000 50% 8.000 75% 11.000 max 19.000 Name: YEARS_ID_PUBLISH, dtype: float64 ValueCounts: YEARS_ID_PUBLISH 11 14.396 12 12.508 10 6.934 13 6.686 7 6.256 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 3.251916191640975e-06, Describe: count 307511.000 mean 1.000 std 0.002 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_MOBIL, dtype: float64 ValueCounts: FLAG_MOBIL 1 100.000 0 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.147671271296676, Describe: count 307511.000 mean 0.820 std 0.384 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_EMP_PHONE, dtype: float64 ValueCounts: FLAG_EMP_PHONE 1 81.989 0 18.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.15962120697935117, Describe: count 307511.000 mean 0.199 std 0.400 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_WORK_PHONE, dtype: float64 ValueCounts: FLAG_WORK_PHONE 0 80.063 1 19.937 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.0018631217575536686, Describe: count 307511.000 mean 0.998 std 0.043 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_CONT_MOBILE, dtype: float64 ValueCounts: FLAG_CONT_MOBILE 1 99.813 0 0.187 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [1 0] Median : 0.0, Variance: 0.20206872204588347, Describe: count 307511.000 mean 0.281 std 0.450 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 1.000 Name: FLAG_PHONE, dtype: float64 ValueCounts: FLAG_PHONE 0 71.893 1 28.107 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0535029466256677, Describe: count 307511.000 mean 0.057 std 0.231 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_EMAIL, dtype: float64 ValueCounts: FLAG_EMAIL 0 94.328 1 5.672 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 96391, Mode: Laborers Unique: ['Laborers' 'Core staff' 'Accountants' 'Managers' nan 'Drivers' 'Sales staff' 'Cleaning staff' 'Cooking staff' 'Private service staff' 'Medicine staff' 'Security staff' 'High skill tech staff' 'Waiters/barmen staff' 'Low-skill Laborers' 'Realty agents' 'Secretaries' 'IT staff' 'HR staff'] ValueCounts: OCCUPATION_TYPE Laborers 26.140 Sales staff 15.206 Core staff 13.059 Managers 10.123 Drivers 8.812 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2.0 Unique: [ 1. 2. 3. 4. 5. 6. 9. 7. 8. 10. 13. 14. 12. 20. 15. 16. 11.] Median : 2.0, Variance: 0.8293356781151717, Describe: count 307511.000 mean 2.153 std 0.911 min 1.000 25% 2.000 50% 2.000 75% 3.000 max 20.000 Name: CNT_FAM_MEMBERS, dtype: float64 ValueCounts: CNT_FAM_MEMBERS 2.000 51.497 1.000 22.063 3.000 17.105 4.000 8.031 5.000 1.131 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2 Unique: [2 1 3] Median : 2.0, Variance: 0.2591155142160186, Describe: count 307511.000 mean 2.052 std 0.509 min 1.000 25% 2.000 50% 2.000 75% 2.000 max 3.000 Name: REGION_RATING_CLIENT, dtype: float64 ValueCounts: REGION_RATING_CLIENT 2 73.813 3 15.717 1 10.470 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2 Unique: [2 1 3] Median : 2.0, Variance: 0.2527445242643602, Describe: count 307511.000 mean 2.032 std 0.503 min 1.000 25% 2.000 50% 2.000 75% 2.000 max 3.000 Name: REGION_RATING_CLIENT_W_CITY, dtype: float64 ValueCounts: REGION_RATING_CLIENT_W_CITY 2 74.626 3 14.263 1 11.111 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: TUESDAY Unique: ['WEDNESDAY' 'MONDAY' 'THURSDAY' 'SUNDAY' 'SATURDAY' 'FRIDAY' 'TUESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START TUESDAY 17.528 WEDNESDAY 16.889 MONDAY 16.492 THURSDAY 16.452 FRIDAY 16.369 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 10 Unique: [10 11 9 17 16 14 8 15 7 13 6 12 19 3 18 21 4 5 20 22 1 2 23 0] Median : 12.0, Variance: 10.665660320665818, Describe: count 307511.000 mean 12.063 std 3.266 min 0.000 25% 10.000 50% 12.000 75% 14.000 max 23.000 Name: HOUR_APPR_PROCESS_START, dtype: float64 ValueCounts: HOUR_APPR_PROCESS_START 10 12.267 11 12.107 12 11.132 13 10.068 14 9.002 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.014914876209068216, Describe: count 307511.000 mean 0.015 std 0.122 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_REGION_NOT_LIVE_REGION, dtype: float64 ValueCounts: REG_REGION_NOT_LIVE_REGION 0 98.486 1 1.514 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.04819158950907743, Describe: count 307511.000 mean 0.051 std 0.220 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_REGION_NOT_WORK_REGION, dtype: float64 ValueCounts: REG_REGION_NOT_WORK_REGION 0 94.923 1 5.077 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.039005704439295116, Describe: count 307511.000 mean 0.041 std 0.197 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: LIVE_REGION_NOT_WORK_REGION, dtype: float64 ValueCounts: LIVE_REGION_NOT_WORK_REGION 0 95.934 1 4.066 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.07206205892600395, Describe: count 307511.000 mean 0.078 std 0.268 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_CITY_NOT_LIVE_CITY, dtype: float64 ValueCounts: REG_CITY_NOT_LIVE_CITY 0 92.183 1 7.817 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.17734528517578357, Describe: count 307511.000 mean 0.230 std 0.421 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_CITY_NOT_WORK_CITY, dtype: float64 ValueCounts: REG_CITY_NOT_WORK_CITY 0 76.955 1 23.045 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.14731519424670028, Describe: count 307511.000 mean 0.180 std 0.384 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: LIVE_CITY_NOT_WORK_CITY, dtype: float64 ValueCounts: LIVE_CITY_NOT_WORK_CITY 0 82.045 1 17.955 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Business Entity Type 3 ValueCounts: ORGANIZATION_TYPE Business Entity Type 3 22.110 XNA 18.007 Self-employed 12.491 Other 5.425 Medicine 3.640 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.2858978721410488 Median : 0.5659614260608526, Variance: 0.03643133076657983, Describe: count 307511.000 mean 0.515 std 0.191 min 0.000 25% 0.393 50% 0.566 75% 0.663 max 0.855 Name: EXT_SOURCE_2, dtype: float64 ValueCounts: EXT_SOURCE_2 0.286 0.234 0.566 0.215 0.262 0.136 0.265 0.112 0.160 0.105 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.5352762504724826 Median : 0.5352762504724826, Variance: 0.030532570833502897, Describe: count 307511.000 mean 0.516 std 0.175 min 0.001 25% 0.417 50% 0.535 75% 0.636 max 0.896 Name: EXT_SOURCE_3, dtype: float64 ValueCounts: EXT_SOURCE_3 0.535 20.081 0.746 0.475 0.714 0.428 0.694 0.415 0.671 0.387 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.10254666268544127 Median : 0.10254666268544127, Variance: 0.0059740116766552115, Describe: count 307511.000 mean 0.103 std 0.077 min 0.000 25% 0.067 50% 0.103 75% 0.103 max 1.000 Name: TOTALAREA_MODE, dtype: float64 ValueCounts: TOTALAREA_MODE 0.103 48.269 0.000 0.189 0.057 0.080 0.055 0.075 0.055 0.074 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: No Unique: ['No' 'Yes'] ValueCounts: EMERGENCYSTATE_MODE No 99.243 Yes 0.757 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 1. 0. 4. 8. 10. 7. 3. 6. 5. 12. 9. 13. 11. 14. 22. 16. 15. 17. 20. 25. 19. 18. 21. 24. 23. 28. 26. 29. 27. 47. 348. 30.] Median : 0.0, Variance: 5.75230052383172, Describe: count 307511.000 mean 1.418 std 2.398 min 0.000 25% 0.000 50% 0.000 75% 2.000 max 348.000 Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: OBS_30_CNT_SOCIAL_CIRCLE 0.000 53.634 1.000 15.864 2.000 9.693 3.000 6.609 4.000 4.599 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 0. 1. 3. 4. 5. 6. 7. 34. 8.] Median : 0.0, Variance: 0.19894504078331582, Describe: count 307511.000 mean 0.143 std 0.446 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 34.000 Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: DEF_30_CNT_SOCIAL_CIRCLE 0.000 88.564 1.000 9.212 2.000 1.731 3.000 0.388 4.000 0.082 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 1. 0. 4. 8. 10. 7. 3. 6. 5. 12. 9. 13. 11. 14. 21. 15. 22. 16. 20. 25. 17. 19. 18. 24. 23. 28. 29. 27. 47. 344. 30. 26.] Median : 0.0, Variance: 5.651195211134213, Describe: count 307511.000 mean 1.401 std 2.377 min 0.000 25% 0.000 50% 0.000 75% 2.000 max 344.000 Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: OBS_60_CNT_SOCIAL_CIRCLE 0.000 53.880 1.000 15.892 2.000 9.680 3.000 6.574 4.000 4.535 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 0. 1. 3. 5. 4. 7. 24. 6.] Median : 0.0, Variance: 0.13085195710647549, Describe: count 307511.000 mean 0.100 std 0.362 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 24.000 Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: DEF_60_CNT_SOCIAL_CIRCLE 0.000 91.620 1.000 7.103 2.000 1.031 3.000 0.194 4.000 0.044 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [ 3 2 1 6 4 0 7 5 8 9 10 11] Median : 2.0, Variance: 4.812206075852008, Describe: count 307511.000 mean 2.225 std 2.194 min 0.000 25% 0.000 50% 2.000 75% 4.000 max 11.000 Name: YEARS_LAST_PHONE_CHANGE, dtype: float64 ValueCounts: YEARS_LAST_PHONE_CHANGE 0 30.064 1 18.744 4 12.377 2 12.338 3 9.733 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 4.227326079183737e-05, Describe: count 307511.000 mean 0.000 std 0.007 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_2, dtype: float64 ValueCounts: FLAG_DOCUMENT_2 0 99.996 1 0.004 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.20589084885614972, Describe: count 307511.000 mean 0.710 std 0.454 min 0.000 25% 0.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_DOCUMENT_3, dtype: float64 ValueCounts: FLAG_DOCUMENT_3 1 71.002 0 28.998 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 8.129155979353892e-05, Describe: count 307511.000 mean 0.000 std 0.009 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_4, dtype: float64 ValueCounts: FLAG_DOCUMENT_4 0 99.992 1 0.008 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.014886494471071949, Describe: count 307511.000 mean 0.015 std 0.122 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_5, dtype: float64 ValueCounts: FLAG_DOCUMENT_5 0 98.489 1 1.511 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.08030189665614886, Describe: count 307511.000 mean 0.088 std 0.283 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_6, dtype: float64 ValueCounts: FLAG_DOCUMENT_6 0 91.194 1 8.806 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00019182686767647077, Describe: count 307511.000 mean 0.000 std 0.014 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_7, dtype: float64 ValueCounts: FLAG_DOCUMENT_7 0 99.981 1 0.019 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.07475414850818889, Describe: count 307511.000 mean 0.081 std 0.273 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_8, dtype: float64 ValueCounts: FLAG_DOCUMENT_8 0 91.862 1 8.138 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0038806309936899315, Describe: count 307511.000 mean 0.004 std 0.062 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_9, dtype: float64 ValueCounts: FLAG_DOCUMENT_9 0 99.610 1 0.390 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 2.276296919164723e-05, Describe: count 307511.000 mean 0.000 std 0.005 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_10, dtype: float64 ValueCounts: FLAG_DOCUMENT_10 0 99.998 1 0.002 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0038967636747465713, Describe: count 307511.000 mean 0.004 std 0.062 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_11, dtype: float64 ValueCounts: FLAG_DOCUMENT_11 0 99.609 1 0.391 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 6.5038112332850705e-06, Describe: count 307511.000 mean 0.000 std 0.003 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_12, dtype: float64 ValueCounts: FLAG_DOCUMENT_12 0 99.999 1 0.001 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.003512662405688219, Describe: count 307511.000 mean 0.004 std 0.059 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_13, dtype: float64 ValueCounts: FLAG_DOCUMENT_13 0 99.647 1 0.353 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0029278669255157, Describe: count 307511.000 mean 0.003 std 0.054 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_14, dtype: float64 ValueCounts: FLAG_DOCUMENT_14 0 99.706 1 0.294 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0012082533473030956, Describe: count 307511.000 mean 0.001 std 0.035 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_15, dtype: float64 ValueCounts: FLAG_DOCUMENT_15 0 99.879 1 0.121 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.009829564925667255, Describe: count 307511.000 mean 0.010 std 0.099 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_16, dtype: float64 ValueCounts: FLAG_DOCUMENT_16 0 99.007 1 0.993 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00026658688860710627, Describe: count 307511.000 mean 0.000 std 0.016 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_17, dtype: float64 ValueCounts: FLAG_DOCUMENT_17 0 99.973 1 0.027 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00806372320835816, Describe: count 307511.000 mean 0.008 std 0.090 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_18, dtype: float64 ValueCounts: FLAG_DOCUMENT_18 0 99.187 1 0.813 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0005947484523376042, Describe: count 307511.000 mean 0.001 std 0.024 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_19, dtype: float64 ValueCounts: FLAG_DOCUMENT_19 0 99.940 1 0.060 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0005070432225544667, Describe: count 307511.000 mean 0.001 std 0.023 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_20, dtype: float64 ValueCounts: FLAG_DOCUMENT_20 0 99.949 1 0.051 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00033483626685604196, Describe: count 307511.000 mean 0.000 std 0.018 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_21, dtype: float64 ValueCounts: FLAG_DOCUMENT_21 0 99.967 1 0.033 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1. 2. 3. 4.] Median : 0.0, Variance: 0.006086204557259623, Describe: count 307511.000 mean 0.006 std 0.078 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 4.000 Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_HOUR 0.000 99.471 1.000 0.507 2.000 0.018 3.000 0.003 4.000 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1. 3. 2. 4. 5. 6. 9. 8.] Median : 0.0, Variance: 0.010616648120285148, Describe: count 307511.000 mean 0.006 std 0.103 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 9.000 Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_DAY 0.000 99.516 1.000 0.420 2.000 0.034 3.000 0.015 4.000 0.008 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1. 3. 2. 4. 5. 6. 8. 7.] Median : 0.0, Variance: 0.03637714618090649, Describe: count 307511.000 mean 0.030 std 0.191 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 8.000 Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_WEEK 0.000 97.224 1.000 2.669 2.000 0.065 3.000 0.019 4.000 0.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 0. 1. 2. 6. 5. 3. 7. 9. 4. 11. 8. 16. 12. 14. 10. 13. 17. 24. 19. 15. 23. 18. 27. 22.] Median : 0.0, Variance: 0.7341235021786728, Describe: count 307511.000 mean 0.231 std 0.857 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 27.000 Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_MON 0.000 85.770 1.000 10.779 2.000 1.751 3.000 0.647 4.000 0.350 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 0. 1. 2. 4. 3. 8. 5. 6. 7. 261. 19.] Median : 0.0, Variance: 0.5536237845454758, Describe: count 307511.000 mean 0.230 std 0.744 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 261.000 Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_QRT 0.000 83.553 1.000 11.012 2.000 4.687 3.000 0.558 4.000 0.155 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 1. 0. 2. 4. 5. 3. 8. 6. 9. 7. 10. 11. 13. 16. 12. 25. 23. 15. 14. 22. 17. 19. 18. 21. 20.] Median : 1.0, Variance: 3.444070410791478, Describe: count 307511.000 mean 1.643 std 1.856 min 0.000 25% 0.000 50% 1.000 75% 3.000 max 25.000 Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_YEAR 0.000 36.851 1.000 20.619 2.000 16.322 3.000 10.936 4.000 6.736 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.1667 Median : 0.1667, Variance: 0.01126328020998967, Describe: count 307511.000 mean 0.196 std 0.106 min 0.000 25% 0.167 50% 0.167 75% 0.167 max 1.000 Name: FLOORSMAX, dtype: float64 ValueCounts: FLOORSMAX 0.167 69.852 0.333 10.352 0.042 4.735 0.375 2.531 0.125 2.188 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.9775174983915752 Median : 0.9775174983915752, Variance: 0.0018895990628744477, Describe: count 307511.000 mean 0.978 std 0.043 min 0.000 25% 0.978 50% 0.978 75% 0.982 max 1.000 Name: YEARS_BEGINEXPLOITATION, dtype: float64 ValueCounts: YEARS_BEGINEXPLOITATION 0.978 48.781 0.987 1.316 0.986 1.264 0.987 1.255 0.980 1.234 Name: proportion, dtype: float64 ------------------------------------------------------------------
- Apart from OCCUPATION_TYPE column all other columns are either dropped, or imputed
- OCCUPATION_TYPE column still has 31% missing values, upon analysis , it is decided that it cannot be replaced only with MODE, since there is no significant representation of categories.
- Also the column cannot be dropped since it is a critical feature, therefore we decided to identify whether it is MAR, MCAR, MNAR.
- So we keep the column as it is until later stage.
curr_appl_data1["AMT_CREDIT_BINS"] = pd.cut(curr_appl_data1['AMT_CREDIT'], bins=[0,200000,400000,600000,800000,1000000,10000000], labels=["0-200K","200-400k","400-600k","600-800k","800-1M","1M+"])
curr_appl_data1["YEARS_EMPLOYED_BINS"] = pd.cut(curr_appl_data1['YEARS_EMPLOYED'], bins=[-100,10,20,30,40,50,60,1000], labels=["0-10","10-20","20-30","30-40","40-50","50-60","60+"])
curr_appl_data1['AGE_Category'] = pd.cut(curr_appl_data1['YEARS_BIRTH'], [0,30,40,50,60,200], labels=["<30","30-40","40-50","50-60","60+"])
curr_appl_data1
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | YEARS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | FLOORSMAX | YEARS_BEGINEXPLOITATION | AMT_CREDIT_BINS | YEARS_EMPLOYED_BINS | AGE_Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | No | Yes | 0 | 202500.000 | 406597.500 | 24700.500 | 351000.000 | Unaccompanied | Working | Secondary | Single | House | 0.019 | 25 | 1 | 9 | 5 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.000 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.263 | 0.139 | 0.015 | No | 2.000 | 2.000 | 2.000 | 2.000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.083 | 0.972 | 400-600k | 0-10 | <30 |
1 | 100003 | 0 | Cash loans | F | No | No | 0 | 270000.000 | 1293502.500 | 35698.500 | 1129500.000 | Family | State servant | Higher education | Married | House | 0.004 | 45 | 3 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.000 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622 | 0.535 | 0.071 | No | 1.000 | 0.000 | 1.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.292 | 0.985 | 1M+ | 0-10 | 40-50 |
2 | 100004 | 0 | Revolving loans | M | Yes | Yes | 0 | 67500.000 | 135000.000 | 6750.000 | 135000.000 | Unaccompanied | Working | Secondary | Single | House | 0.010 | 52 | 0 | 11 | 6 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.000 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.556 | 0.730 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 | 0-200K | 0-10 | 50-60 |
3 | 100006 | 0 | Cash loans | F | No | Yes | 0 | 135000.000 | 312682.500 | 29686.500 | 297000.000 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.008 | 52 | 8 | 26 | 6 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.000 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650 | 0.535 | 0.103 | No | 2.000 | 0.000 | 2.000 | 0.000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 | 200-400k | 0-10 | 50-60 |
4 | 100007 | 0 | Cash loans | M | No | Yes | 0 | 121500.000 | 513000.000 | 21865.500 | 513000.000 | Unaccompanied | Working | Secondary | Single | House | 0.029 | 54 | 8 | 11 | 9 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.000 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.323 | 0.535 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 | 400-600k | 0-10 | 50-60 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
307506 | 456251 | 0 | Cash loans | M | No | No | 0 | 157500.000 | 254700.000 | 27558.000 | 225000.000 | Unaccompanied | Working | Secondary | Separated | With parents | 0.033 | 25 | 0 | 23 | 5 | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1.000 | 1 | 1 | THURSDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.682 | 0.535 | 0.290 | No | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.556 | 0.984 | 200-400k | 0-10 | <30 |
307507 | 456252 | 0 | Cash loans | F | No | Yes | 0 | 72000.000 | 269550.000 | 12001.500 | 225000.000 | Unaccompanied | Pensioner | Secondary | Widow | House | 0.025 | 56 | 1000 | 12 | 11 | 1 | 0 | 0 | 1 | 1 | 0 | NaN | 1.000 | 2 | 2 | MONDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.116 | 0.535 | 0.021 | No | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.083 | 0.973 | 200-400k | 60+ | 50-60 |
307508 | 456253 | 0 | Cash loans | F | No | Yes | 0 | 153000.000 | 677664.000 | 29979.000 | 585000.000 | Unaccompanied | Working | Higher education | Separated | House | 0.005 | 41 | 21 | 18 | 14 | 1 | 1 | 0 | 1 | 0 | 1 | Managers | 1.000 | 3 | 3 | THURSDAY | 9 | 0 | 0 | 0 | 0 | 1 | 1 | School | 0.536 | 0.219 | 0.797 | No | 6.000 | 0.000 | 6.000 | 0.000 | 5 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.167 | 0.982 | 600-800k | 20-30 | 40-50 |
307509 | 456254 | 1 | Cash loans | F | No | Yes | 0 | 171000.000 | 370107.000 | 20205.000 | 319500.000 | Unaccompanied | Commercial associate | Secondary | Married | House | 0.005 | 32 | 13 | 7 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.000 | 2 | 2 | WEDNESDAY | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | 0.514 | 0.661 | 0.009 | No | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.042 | 0.977 | 200-400k | 10-20 | 30-40 |
307510 | 456255 | 0 | Cash loans | F | No | No | 0 | 157500.000 | 675000.000 | 49117.500 | 675000.000 | Unaccompanied | Commercial associate | Higher education | Married | House | 0.046 | 46 | 3 | 14 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.000 | 1 | 1 | THURSDAY | 20 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.709 | 0.114 | 0.072 | No | 0.000 | 0.000 | 0.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 2.000 | 0.000 | 1.000 | 0.375 | 0.988 | 600-800k | 0-10 | 40-50 |
307511 rows × 80 columns
curr_appl_data1[curr_appl_data1['OCCUPATION_TYPE'].isna()]['NAME_INCOME_TYPE'].value_counts(normalize=True)
NAME_INCOME_TYPE Pensioner 0.574 Working 0.259 Commercial associate 0.128 State servant 0.039 Unemployed 0.000 Student 0.000 Businessman 0.000 Maternity leave 0.000 Name: proportion, dtype: float64
Majority of the null values are under Pensioner category, therefore it is highly acceptable to replace Null values in Occupation Types with Pensioner as category
curr_appl_data1["OCCUPATION_TYPE"] = curr_appl_data1["OCCUPATION_TYPE"].fillna('Pensioner')
Univariate and Outlier analysis¶
curr_target_train_0 = curr_appl_data1.loc[curr_appl_data1['TARGET'] == 0]
curr_target_train_1 = curr_appl_data1.loc[curr_appl_data1['TARGET'] == 1]
- The Requirement is to analyze the features based on Target variable, we therefore segmented it into two parts:
- Train 1 - All the data that is client with payment difficulties
- Train 0 - All the other set, that doesn't fall under
- As part of univariate analysis, we will generate plots and charts for all the columns depending on the data type. The inferences are explained below
curr_desc_cols = curr_appl_data1.describe().columns
dtype_dict = classify_feature_dtype(curr_appl_data1,curr_desc_cols)
curr_dtype_dict = dtype_dict['float_ts']
# univariate_plots(curr_appl_data1.head(60), curr_appl_data1.select_dtypes(object).columns, t0=curr_target_train_0, t1=curr_target_train_1, ftype="category")
# print("--------------------------------------------------------------------------")
# univariate_plots(curr_appl_data1.head(60), dtype_dict['int_cat'], t0=curr_target_train_0, t1=curr_target_train_1, ftype="category")
# print("--------------------------------------------------------------------------")
# univariate_plots(curr_appl_data1.head(60), dtype_dict['float_ts'], ftype="non_categorical")
univariate_plots(curr_appl_data1, curr_appl_data1.select_dtypes(object).columns, t0 = curr_target_train_0, t1 = curr_target_train_1, ftype = "category")
print("--------------------------------------------------------------------------")
univariate_plots(curr_appl_data1, dtype_dict['int_cat'], t0 = curr_target_train_0, t1 = curr_target_train_1, ftype = "category")
print("--------------------------------------------------------------------------")
univariate_plots(curr_appl_data1, dtype_dict['float_ts'], ftype = "non_categorical")
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
--------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
--------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
- Univariate and Outlier analysis:
- Important Observations
Name_Contract Type:
- In the given dataset, Majority of the loans are cash based loans.
- Clients find Cash based loans attractive.
Code Gender:
- Majority of the clients are Females,
- Females interest towards loan application is higher compared to Males.
- They are more responsible towards timely repayment of the installments
Flag Own Car:
- Majority of the applicants do not own car.
Flag Own Realty:
- Majority of the applicants own a property.
Name_Income_Type:
- Majority of the applicants are working citizens.
Name_education_type:
- Majority of the applicants requesting for loan have completed secondary education
Name_Family_Status:
- Most applicants requesting for loan are Married.
Housing Type:
- Majority of the applicants own a house
Occupation Type:
- Most applications are laborers.
- The possibility of their loan getting rejected are very high.
Target Feature
- Majority of the applicants do not have any payment difficulties
Flag_Mobile:
- Almost everyone owns a mobile.
Flag_columns:
- Majority of the flag columns are mostly one category, therefore not much insights can be derived.
CNT_children:
- There are outliers in cnt_children feature.
- Many applicants have either one or two childrens
AMT_INcome:
- There are outliers in the income feature.
AMT_Credit:
- There are outlier in the AMT_credit feature.
- Binnig or capping needs to be done.
AMT_Annuity:
- Majority of the applicants have similar annuity
- skewed distribution
Years_Birth
- Majority if the population is between 30 -40 age
Years employed
- There are outliers
Years Registration:
- Majority are applicants have recently changed their registrations.
CNT_Family Members:
- There are outliers
- Most applicants have between 1 to 5 family members
Most clients applied for the loan between 10-13 hrs, which is predictable.
Years employed:
- Outliers are present
Ext source 2:
- Most clients have around 0.6 rating
Ext source 3:
- Most clients have around 0.5 rating
Data Cleaning - Previous Application Dataset¶
prev_appl_na_pct = check_cols_null_pct(prev_appl_data)
prev_appl_data1 = prev_appl_data.loc[:, (prev_appl_na_pct < 50)]
prev_appl_data1.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_GOODS_PRICE', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY', 'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY', 'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'], dtype='object')
- Since there are many columns that have null values and are more than 50% it is generally not recommended to impute it,
- Therefore we drop all columns gt 50%
show_stats(prev_appl_data1, prev_appl_data1.columns)
prev_appl_na_pct1 = check_cols_null_pct(prev_appl_data1)
prev_appl_na_pct1
Total Nulls: 0, Mode: 1000001 Median : 1923110.5, Variance: 283660585606.6597, Describe: count 1670214.000 mean 1923089.135 std 532597.959 min 1000001.000 25% 1461857.250 50% 1923110.500 75% 2384279.750 max 2845382.000 Name: SK_ID_PREV, dtype: float64 ValueCounts: SK_ID_PREV 2030495 0.000 1035848 0.000 1526498 0.000 2148893 0.000 2437429 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 187868 Median : 278714.5, Variance: 10570888003.009909, Describe: count 1670214.000 mean 278357.174 std 102814.824 min 100001.000 25% 189329.000 50% 278714.500 75% 367514.000 max 456255.000 Name: SK_ID_CURR, dtype: float64 ValueCounts: SK_ID_CURR 187868 0.005 265681 0.004 173680 0.004 242412 0.004 206783 0.004 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash loans Unique: ['Consumer loans' 'Cash loans' 'Revolving loans' 'XNA'] ValueCounts: NAME_CONTRACT_TYPE Cash loans 44.758 Consumer loans 43.656 Revolving loans 11.565 XNA 0.021 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 372235, Mode: 2250.0 Median : 11250.0, Variance: 218511584.1819954, Describe: count 1297979.000 mean 15955.121 std 14782.137 min 0.000 25% 6321.780 50% 11250.000 75% 20658.420 max 418058.145 Name: AMT_ANNUITY, dtype: float64 ValueCounts: AMT_ANNUITY 2250.000 2.455 11250.000 1.077 6750.000 1.036 9000.000 0.963 22500.000 0.917 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Median : 71046.0, Variance: 85719989263.5158, Describe: count 1670214.000 mean 175233.860 std 292779.762 min 0.000 25% 18720.000 50% 71046.000 75% 180360.000 max 6905160.000 Name: AMT_APPLICATION, dtype: float64 ValueCounts: AMT_APPLICATION 0.000 23.494 45000.000 2.864 225000.000 2.607 135000.000 2.435 450000.000 2.329 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1, Mode: 0.0 Median : 80541.0, Variance: 101489786307.34668, Describe: count 1670213.000 mean 196114.021 std 318574.617 min 0.000 25% 24160.500 50% 80541.000 75% 216418.500 max 6905160.000 Name: AMT_CREDIT, dtype: float64 ValueCounts: AMT_CREDIT 0.000 20.163 45000.000 2.099 225000.000 1.263 450000.000 1.195 135000.000 1.121 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 385515, Mode: 45000.0 Median : 112320.0, Variance: 99474988758.27444, Describe: count 1284699.000 mean 227847.279 std 315396.558 min 0.000 25% 50841.000 50% 112320.000 75% 234000.000 max 6905160.000 Name: AMT_GOODS_PRICE, dtype: float64 ValueCounts: AMT_GOODS_PRICE 45000.000 3.723 225000.000 3.390 135000.000 3.165 450000.000 3.030 90000.000 2.286 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: TUESDAY Unique: ['SATURDAY' 'THURSDAY' 'TUESDAY' 'MONDAY' 'FRIDAY' 'SUNDAY' 'WEDNESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START TUESDAY 15.275 WEDNESDAY 15.268 MONDAY 15.181 FRIDAY 15.091 THURSDAY 14.914 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11 Unique: [15 11 7 9 8 10 12 13 14 16 6 4 5 19 17 18 20 22 21 3 1 2 23 0] Median : 12.0, Variance: 11.11574247090721, Describe: count 1670214.000 mean 12.484 std 3.334 min 0.000 25% 10.000 50% 12.000 75% 15.000 max 23.000 Name: HOUR_APPR_PROCESS_START, dtype: float64 ValueCounts: HOUR_APPR_PROCESS_START 11 11.539 12 11.135 10 10.878 13 10.313 14 9.443 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Y Unique: ['Y' 'N'] ValueCounts: FLAG_LAST_APPL_PER_CONTRACT Y 99.493 N 0.507 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.003520005148666215, Describe: count 1670214.000 mean 0.996 std 0.059 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: NFLAG_LAST_APPL_IN_DAY, dtype: float64 ValueCounts: NFLAG_LAST_APPL_IN_DAY 1 99.647 0 0.353 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'XNA' 'Repairs' 'Everyday expenses' 'Car repairs' 'Building a house or an annex' 'Other' 'Journey' 'Purchase of electronic equipment' 'Medicine' 'Payments on other loans' 'Urgent needs' 'Buying a used car' 'Buying a new car' 'Buying a holiday home / land' 'Education' 'Buying a home' 'Furniture' 'Buying a garage' 'Business development' 'Wedding / gift / holiday' 'Hobby' 'Gasification / water supply' 'Refusal to name the goal' 'Money for a third person'] ValueCounts: NAME_CASH_LOAN_PURPOSE XAP 55.242 XNA 40.589 Repairs 1.423 Other 0.934 Urgent needs 0.504 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Approved Unique: ['Approved' 'Refused' 'Canceled' 'Unused offer'] ValueCounts: NAME_CONTRACT_STATUS Approved 62.075 Canceled 18.939 Refused 17.404 Unused offer 1.583 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -245 Median : -581.0, Variance: 606996.2907461969, Describe: count 1670214.000 mean -880.680 std 779.100 min -2922.000 25% -1300.000 50% -581.000 75% -280.000 max -1.000 Name: DAYS_DECISION, dtype: float64 ValueCounts: DAYS_DECISION -245 0.146 -238 0.143 -210 0.142 -273 0.141 -196 0.139 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash through the bank Unique: ['Cash through the bank' 'XNA' 'Non-cash from your account' 'Cashless from the account of the employer'] ValueCounts: NAME_PAYMENT_TYPE Cash through the bank 61.881 XNA 37.563 Non-cash from your account 0.491 Cashless from the account of the employer 0.065 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'HC' 'LIMIT' 'CLIENT' 'SCOFR' 'SCO' 'XNA' 'VERIF' 'SYSTEM'] ValueCounts: CODE_REJECT_REASON XAP 81.013 HC 10.492 LIMIT 3.334 SCO 2.243 CLIENT 1.583 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 820405, Mode: Unaccompanied Unique: [nan 'Unaccompanied' 'Spouse, partner' 'Family' 'Children' 'Other_B' 'Other_A' 'Group of people'] ValueCounts: NAME_TYPE_SUITE Unaccompanied 59.892 Family 25.095 Spouse, partner 7.892 Children 3.714 Other_B 2.074 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Repeater Unique: ['Repeater' 'New' 'Refreshed' 'XNA'] ValueCounts: NAME_CLIENT_TYPE Repeater 73.719 New 18.043 Refreshed 8.122 XNA 0.116 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Mobile' 'XNA' 'Consumer Electronics' 'Construction Materials' 'Auto Accessories' 'Photo / Cinema Equipment' 'Computers' 'Audio/Video' 'Medicine' 'Clothing and Accessories' 'Furniture' 'Sport and Leisure' 'Homewares' 'Gardening' 'Jewelry' 'Vehicles' 'Education' 'Medical Supplies' 'Other' 'Direct Sales' 'Office Appliances' 'Fitness' 'Tourism' 'Insurance' 'Additional Service' 'Weapon' 'Animals' 'House Construction'] ValueCounts: NAME_GOODS_CATEGORY XNA 56.927 Mobile 13.454 Consumer Electronics 7.279 Computers 6.333 Audio/Video 5.954 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: POS Unique: ['POS' 'Cash' 'XNA' 'Cards' 'Cars'] ValueCounts: NAME_PORTFOLIO POS 41.373 Cash 27.635 XNA 22.286 Cards 8.681 Cars 0.025 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['XNA' 'x-sell' 'walk-in'] ValueCounts: NAME_PRODUCT_TYPE XNA 63.684 x-sell 27.319 walk-in 8.997 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Credit and cash offices Unique: ['Country-wide' 'Contact center' 'Credit and cash offices' 'Stone' 'Regional / Local' 'AP+ (Cash loan)' 'Channel of corporate sales' 'Car dealer'] ValueCounts: CHANNEL_TYPE Credit and cash offices 43.106 Country-wide 29.618 Stone 12.698 Regional / Local 6.498 Contact center 4.269 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -1 Median : 3.0, Variance: 50800450.2636309, Describe: count 1670214.000 mean 313.951 std 7127.443 min -1.000 25% -1.000 50% 3.000 75% 82.000 max 4000000.000 Name: SELLERPLACE_AREA, dtype: float64 ValueCounts: SELLERPLACE_AREA -1 45.663 0 3.624 50 2.239 30 2.061 20 2.026 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Connectivity' 'XNA' 'Consumer electronics' 'Industry' 'Clothing' 'Furniture' 'Construction' 'Jewelry' 'Auto technology' 'MLM partners' 'Tourism'] ValueCounts: NAME_SELLER_INDUSTRY XNA 51.234 Consumer electronics 23.845 Connectivity 16.527 Furniture 3.464 Construction 1.783 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 372230, Mode: 12.0 Median : 12.0, Variance: 212.2058722166199, Describe: count 1297984.000 mean 16.054 std 14.567 min 0.000 25% 6.000 50% 12.000 75% 24.000 max 84.000 Name: CNT_PAYMENT, dtype: float64 ValueCounts: CNT_PAYMENT 12.000 24.889 6.000 14.674 0.000 11.170 10.000 10.929 24.000 10.614 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['middle' 'low_action' 'high' 'low_normal' 'XNA'] ValueCounts: NAME_YIELD_GROUP XNA 30.967 middle 23.083 high 21.155 low_normal 19.285 low_action 5.511 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 346, Mode: Cash Unique: ['POS mobile with interest' 'Cash X-Sell: low' 'Cash X-Sell: high' 'Cash X-Sell: middle' 'Cash Street: high' 'Cash' 'POS household without interest' 'POS household with interest' 'POS other with interest' 'Card X-Sell' 'POS mobile without interest' 'Card Street' 'POS industry with interest' 'Cash Street: low' 'POS industry without interest' 'Cash Street: middle' 'POS others without interest' nan] ValueCounts: PRODUCT_COMBINATION Cash 17.127 POS household with interest 15.787 POS mobile with interest 13.215 Cash X-Sell: middle 8.616 Cash X-Sell: low 7.800 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 673065, Mode: 365243.0 Median : 365243.0, Variance: 7906075654.949019, Describe: count 997149.000 mean 342209.855 std 88916.116 min -2922.000 25% 365243.000 50% 365243.000 75% 365243.000 max 365243.000 Name: DAYS_FIRST_DRAWING, dtype: float64 ValueCounts: DAYS_FIRST_DRAWING 365243.000 93.712 -228.000 0.012 -224.000 0.012 -212.000 0.012 -223.000 0.012 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 673065, Mode: 365243.0 Median : -831.0, Variance: 5248259146.994171, Describe: count 997149.000 mean 13826.269 std 72444.870 min -2892.000 25% -1628.000 50% -831.000 75% -411.000 max 365243.000 Name: DAYS_FIRST_DUE, dtype: float64 ValueCounts: DAYS_FIRST_DUE 365243.000 4.076 -334.000 0.077 -509.000 0.076 -208.000 0.075 -330.000 0.075 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 673065, Mode: 365243.0 Median : -361.0, Variance: 11418425883.920301, Describe: count 997149.000 mean 33767.774 std 106857.035 min -2801.000 25% -1242.000 50% -361.000 75% 129.000 max 365243.000 Name: DAYS_LAST_DUE_1ST_VERSION, dtype: float64 ValueCounts: DAYS_LAST_DUE_1ST_VERSION 365243.000 9.413 9.000 0.072 8.000 0.071 0.000 0.071 5.000 0.070 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 673065, Mode: 365243.0 Median : -537.0, Variance: 22394348853.11387, Describe: count 997149.000 mean 76582.403 std 149647.415 min -2889.000 25% -1314.000 50% -537.000 75% -74.000 max 365243.000 Name: DAYS_LAST_DUE, dtype: float64 ValueCounts: DAYS_LAST_DUE 365243.000 21.182 -245.000 0.066 -188.000 0.065 -239.000 0.064 -167.000 0.064 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 673065, Mode: 365243.0 Median : -499.0, Variance: 23501968241.562744, Describe: count 997149.000 mean 81992.344 std 153303.517 min -2874.000 25% -1270.000 50% -499.000 75% -44.000 max 365243.000 Name: DAYS_TERMINATION, dtype: float64 ValueCounts: DAYS_TERMINATION 365243.000 22.656 -233.000 0.079 -170.000 0.077 -184.000 0.077 -163.000 0.077 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 673065, Mode: 0.0 Unique: [ 0. 1. nan] Median : 0.0, Variance: 0.2219674704465378, Describe: count 997149.000 mean 0.333 std 0.471 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 1.000 Name: NFLAG_INSURED_ON_APPROVAL, dtype: float64 ValueCounts: NFLAG_INSURED_ON_APPROVAL 0.000 66.743 1.000 33.257 Name: proportion, dtype: float64 ------------------------------------------------------------------
NAME_TYPE_SUITE 49.120 DAYS_FIRST_DRAWING 40.298 DAYS_TERMINATION 40.298 DAYS_LAST_DUE 40.298 DAYS_LAST_DUE_1ST_VERSION 40.298 DAYS_FIRST_DUE 40.298 NFLAG_INSURED_ON_APPROVAL 40.298 AMT_GOODS_PRICE 23.082 AMT_ANNUITY 22.287 CNT_PAYMENT 22.286 PRODUCT_COMBINATION 0.021 AMT_CREDIT 0.000 WEEKDAY_APPR_PROCESS_START 0.000 HOUR_APPR_PROCESS_START 0.000 NAME_CONTRACT_TYPE 0.000 AMT_APPLICATION 0.000 NAME_YIELD_GROUP 0.000 NAME_SELLER_INDUSTRY 0.000 SELLERPLACE_AREA 0.000 CHANNEL_TYPE 0.000 NAME_PRODUCT_TYPE 0.000 NAME_PORTFOLIO 0.000 NAME_GOODS_CATEGORY 0.000 NAME_CLIENT_TYPE 0.000 SK_ID_CURR 0.000 CODE_REJECT_REASON 0.000 NAME_PAYMENT_TYPE 0.000 DAYS_DECISION 0.000 NAME_CONTRACT_STATUS 0.000 NAME_CASH_LOAN_PURPOSE 0.000 NFLAG_LAST_APPL_IN_DAY 0.000 FLAG_LAST_APPL_PER_CONTRACT 0.000 SK_ID_PREV 0.000 dtype: float64
prev_appl_data1.select_dtypes(object)
prev_appl_data1.isna().sum()
NAME_CONTRACT_TYPE | WEEKDAY_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | NAME_SELLER_INDUSTRY | NAME_YIELD_GROUP | PRODUCT_COMBINATION | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Consumer loans | SATURDAY | Y | XAP | Approved | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | Connectivity | middle | POS mobile with interest |
1 | Cash loans | THURSDAY | Y | XNA | Approved | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | XNA | low_action | Cash X-Sell: low |
2 | Cash loans | TUESDAY | Y | XNA | Approved | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | XNA | high | Cash X-Sell: high |
3 | Cash loans | MONDAY | Y | XNA | Approved | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | XNA | middle | Cash X-Sell: middle |
4 | Cash loans | THURSDAY | Y | Repairs | Refused | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | XNA | high | Cash Street: high |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1670209 | Consumer loans | WEDNESDAY | Y | XAP | Approved | Cash through the bank | XAP | NaN | Refreshed | Furniture | POS | XNA | Stone | Furniture | low_normal | POS industry with interest |
1670210 | Consumer loans | TUESDAY | Y | XAP | Approved | Cash through the bank | XAP | Unaccompanied | New | Furniture | POS | XNA | Stone | Furniture | middle | POS industry with interest |
1670211 | Consumer loans | MONDAY | Y | XAP | Approved | Cash through the bank | XAP | Spouse, partner | Repeater | Consumer Electronics | POS | XNA | Country-wide | Consumer electronics | low_normal | POS household with interest |
1670212 | Cash loans | WEDNESDAY | Y | XNA | Approved | Cash through the bank | XAP | Family | Repeater | XNA | Cash | x-sell | AP+ (Cash loan) | XNA | low_normal | Cash X-Sell: low |
1670213 | Cash loans | SUNDAY | Y | XNA | Approved | Cash through the bank | XAP | Family | Repeater | XNA | Cash | x-sell | AP+ (Cash loan) | XNA | middle | Cash X-Sell: middle |
1670214 rows × 16 columns
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
prev_appl_data1["DAYS_LAST_DUE"] = prev_appl_data1["DAYS_LAST_DUE"].fillna(prev_appl_data1["DAYS_LAST_DUE"].median())
prev_appl_data1["DAYS_FIRST_DUE"] = prev_appl_data1["DAYS_FIRST_DUE"].fillna(prev_appl_data1["DAYS_FIRST_DUE"].median())
prev_appl_data1["DAYS_TERMINATION"] = prev_appl_data1["DAYS_TERMINATION"].fillna(prev_appl_data1["DAYS_TERMINATION"].median())
prev_appl_data1["DAYS_FIRST_DRAWING"] = prev_appl_data1["DAYS_FIRST_DRAWING"].fillna(prev_appl_data1["DAYS_FIRST_DRAWING"].median())
prev_appl_data1["DAYS_LAST_DUE_1ST_VERSION"] = prev_appl_data1["DAYS_LAST_DUE_1ST_VERSION"].fillna(prev_appl_data1["DAYS_LAST_DUE_1ST_VERSION"].median())
prev_appl_data1["NFLAG_INSURED_ON_APPROVAL"] = prev_appl_data1["NFLAG_INSURED_ON_APPROVAL"].fillna(prev_appl_data1["NFLAG_INSURED_ON_APPROVAL"].mode()[0])
prev_appl_data1["NAME_TYPE_SUITE"] = prev_appl_data1["NAME_TYPE_SUITE"].fillna(prev_appl_data1["NAME_TYPE_SUITE"].mode()[0])
- Impute all missing values with median of the columns
- Imputed categorical columns with the mode of that column.
DAYS_COLS = prev_appl_data1.filter(like="DAYS").columns
prev_appl_data1[DAYS_COLS] = (abs(prev_appl_data1[DAYS_COLS])).astype(int)
- DAYS_COLS - Is a list of days type columns, represented in days count,
- There were inconsistencies, therefore made all values absolute
prev_appl_data1["AMT_ANNUITY"].describe()
prev_appl_data1["AMT_GOODS_PRICE"].describe()
show_stats(prev_appl_data1, ['AMT_ANNUITY','AMT_GOODS_PRICE'])
count 1297979.000 mean 15955.121 std 14782.137 min 0.000 25% 6321.780 50% 11250.000 75% 20658.420 max 418058.145 Name: AMT_ANNUITY, dtype: float64
count 1284699.000 mean 227847.279 std 315396.558 min 0.000 25% 50841.000 50% 112320.000 75% 234000.000 max 6905160.000 Name: AMT_GOODS_PRICE, dtype: float64
Total Nulls: 372235, Mode: 2250.0 Median : 11250.0, Variance: 218511584.1819954, Describe: count 1297979.000 mean 15955.121 std 14782.137 min 0.000 25% 6321.780 50% 11250.000 75% 20658.420 max 418058.145 Name: AMT_ANNUITY, dtype: float64 ValueCounts: AMT_ANNUITY 2250.000 2.455 11250.000 1.077 6750.000 1.036 9000.000 0.963 22500.000 0.917 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 385515, Mode: 45000.0 Median : 112320.0, Variance: 99474988758.27444, Describe: count 1284699.000 mean 227847.279 std 315396.558 min 0.000 25% 50841.000 50% 112320.000 75% 234000.000 max 6905160.000 Name: AMT_GOODS_PRICE, dtype: float64 ValueCounts: AMT_GOODS_PRICE 45000.000 3.723 225000.000 3.390 135000.000 3.165 450000.000 3.030 90000.000 2.286 Name: proportion, dtype: float64 ------------------------------------------------------------------
- After looking at the stats of these columns, it is better to impute with median of these columns.
# prev_appl_data1 = prev_appl_data1[~prev_appl_data1["AMT_ANNUITY"].isna()]
# prev_appl_data1 = prev_appl_data1[~prev_appl_data1["AMT_GOODS_PRICE"].isna()]
prev_appl_data1["AMT_ANNUITY"] = prev_appl_data1["AMT_ANNUITY"].fillna(prev_appl_data1["AMT_ANNUITY"].median())
prev_appl_data1["AMT_GOODS_PRICE"] = prev_appl_data1["AMT_GOODS_PRICE"].fillna(prev_appl_data1["AMT_GOODS_PRICE"].median())
show_stats(prev_appl_data1, prev_appl_data1.select_dtypes(object).columns )
Total Nulls: 0, Mode: Cash loans Unique: ['Consumer loans' 'Cash loans' 'Revolving loans' 'XNA'] ValueCounts: NAME_CONTRACT_TYPE Cash loans 44.758 Consumer loans 43.656 Revolving loans 11.565 XNA 0.021 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: TUESDAY Unique: ['SATURDAY' 'THURSDAY' 'TUESDAY' 'MONDAY' 'FRIDAY' 'SUNDAY' 'WEDNESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START TUESDAY 15.275 WEDNESDAY 15.268 MONDAY 15.181 FRIDAY 15.091 THURSDAY 14.914 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Y Unique: ['Y' 'N'] ValueCounts: FLAG_LAST_APPL_PER_CONTRACT Y 99.493 N 0.507 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'XNA' 'Repairs' 'Everyday expenses' 'Car repairs' 'Building a house or an annex' 'Other' 'Journey' 'Purchase of electronic equipment' 'Medicine' 'Payments on other loans' 'Urgent needs' 'Buying a used car' 'Buying a new car' 'Buying a holiday home / land' 'Education' 'Buying a home' 'Furniture' 'Buying a garage' 'Business development' 'Wedding / gift / holiday' 'Hobby' 'Gasification / water supply' 'Refusal to name the goal' 'Money for a third person'] ValueCounts: NAME_CASH_LOAN_PURPOSE XAP 55.242 XNA 40.589 Repairs 1.423 Other 0.934 Urgent needs 0.504 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Approved Unique: ['Approved' 'Refused' 'Canceled' 'Unused offer'] ValueCounts: NAME_CONTRACT_STATUS Approved 62.075 Canceled 18.939 Refused 17.404 Unused offer 1.583 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash through the bank Unique: ['Cash through the bank' 'XNA' 'Non-cash from your account' 'Cashless from the account of the employer'] ValueCounts: NAME_PAYMENT_TYPE Cash through the bank 61.881 XNA 37.563 Non-cash from your account 0.491 Cashless from the account of the employer 0.065 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'HC' 'LIMIT' 'CLIENT' 'SCOFR' 'SCO' 'XNA' 'VERIF' 'SYSTEM'] ValueCounts: CODE_REJECT_REASON XAP 81.013 HC 10.492 LIMIT 3.334 SCO 2.243 CLIENT 1.583 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Unaccompanied Unique: ['Unaccompanied' 'Spouse, partner' 'Family' 'Children' 'Other_B' 'Other_A' 'Group of people'] ValueCounts: NAME_TYPE_SUITE Unaccompanied 79.593 Family 12.769 Spouse, partner 4.016 Children 1.890 Other_B 1.055 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Repeater Unique: ['Repeater' 'New' 'Refreshed' 'XNA'] ValueCounts: NAME_CLIENT_TYPE Repeater 73.719 New 18.043 Refreshed 8.122 XNA 0.116 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Mobile' 'XNA' 'Consumer Electronics' 'Construction Materials' 'Auto Accessories' 'Photo / Cinema Equipment' 'Computers' 'Audio/Video' 'Medicine' 'Clothing and Accessories' 'Furniture' 'Sport and Leisure' 'Homewares' 'Gardening' 'Jewelry' 'Vehicles' 'Education' 'Medical Supplies' 'Other' 'Direct Sales' 'Office Appliances' 'Fitness' 'Tourism' 'Insurance' 'Additional Service' 'Weapon' 'Animals' 'House Construction'] ValueCounts: NAME_GOODS_CATEGORY XNA 56.927 Mobile 13.454 Consumer Electronics 7.279 Computers 6.333 Audio/Video 5.954 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: POS Unique: ['POS' 'Cash' 'XNA' 'Cards' 'Cars'] ValueCounts: NAME_PORTFOLIO POS 41.373 Cash 27.635 XNA 22.286 Cards 8.681 Cars 0.025 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['XNA' 'x-sell' 'walk-in'] ValueCounts: NAME_PRODUCT_TYPE XNA 63.684 x-sell 27.319 walk-in 8.997 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Credit and cash offices Unique: ['Country-wide' 'Contact center' 'Credit and cash offices' 'Stone' 'Regional / Local' 'AP+ (Cash loan)' 'Channel of corporate sales' 'Car dealer'] ValueCounts: CHANNEL_TYPE Credit and cash offices 43.106 Country-wide 29.618 Stone 12.698 Regional / Local 6.498 Contact center 4.269 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Connectivity' 'XNA' 'Consumer electronics' 'Industry' 'Clothing' 'Furniture' 'Construction' 'Jewelry' 'Auto technology' 'MLM partners' 'Tourism'] ValueCounts: NAME_SELLER_INDUSTRY XNA 51.234 Consumer electronics 23.845 Connectivity 16.527 Furniture 3.464 Construction 1.783 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['middle' 'low_action' 'high' 'low_normal' 'XNA'] ValueCounts: NAME_YIELD_GROUP XNA 30.967 middle 23.083 high 21.155 low_normal 19.285 low_action 5.511 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 346, Mode: Cash Unique: ['POS mobile with interest' 'Cash X-Sell: low' 'Cash X-Sell: high' 'Cash X-Sell: middle' 'Cash Street: high' 'Cash' 'POS household without interest' 'POS household with interest' 'POS other with interest' 'Card X-Sell' 'POS mobile without interest' 'Card Street' 'POS industry with interest' 'Cash Street: low' 'POS industry without interest' 'Cash Street: middle' 'POS others without interest' nan] ValueCounts: PRODUCT_COMBINATION Cash 17.127 POS household with interest 15.787 POS mobile with interest 13.215 Cash X-Sell: middle 8.616 Cash X-Sell: low 7.800 Name: proportion, dtype: float64 ------------------------------------------------------------------
Univariate and Outlier analysis - Prev Application¶
- As part of univariate analysis, we will generate plots and charts for all the columns depending on the data type. The inferences are explained below
prev_desc_cols = prev_appl_data1.describe().columns
prev_dtype_dict = dtype_dict = classify_feature_dtype(prev_appl_data1,prev_desc_cols)
prev_dtype_dict = dtype_dict['float_ts']
# univariate_plots(prev_appl_data1.head(60), prev_appl_data1.select_dtypes(object).columns, ftype="category")
# print("--------------------------------------------------------------------------")
# univariate_plots(prev_appl_data1.head(60), dtype_dict['int_cat'], ftype="category")
# print("--------------------------------------------------------------------------")
# univariate_plots(prev_appl_data1.head(60), dtype_dict['float_ts'], ftype="non_categorical")
univariate_plots(prev_appl_data1, prev_appl_data1.select_dtypes(object).columns, ftype="category")
print("--------------------------------------------------------------------------")
univariate_plots(prev_appl_data1, dtype_dict['int_cat'], ftype="category")
print("--------------------------------------------------------------------------")
univariate_plots(prev_appl_data1, dtype_dict['float_ts'], ftype="non_categorical")
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
--------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
--------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
NAME_CONTRACT_TYPE - for prev applications majority are split between casloans and consumer loans
WEEKDAY_APPR_PROCESS_START - looks like applications are processed on Sundays , which looks like a outlier
CASH Loan purpose
- Majority provided a reason as XAP,
- the second highest is XNA, therefore this needs to be captured correctly for better insights
NAME_CONTRACT_STATUS
- Majority of the Applications were approved
CODE_REJECT_RREASON
- Again the code reject reason is not properly catured,
- this is having greater impacts on our analysis
Name client type
- MAjority of the clients have reapplied for loan, which is good sign for credit agencies
Name Good Category
- Majority provided a reason as XNA,
Name Product Type
- Majority provided a reason as XNA,
Name Seller Industry
- Majority provided a reason as XNA,
Name YIELD Group
- Majority provided a reason as XNA,
AMT_ANNUITY , AMT_APPLICATION, AMT_Credit, CNT_PAYMENT, DAYS _Termination, DAYS_LAST_DUE
- There are outliers
show_stats(prev_appl_data1, prev_appl_data1.columns)
check_cols_null_pct(prev_appl_data1)
Total Nulls: 0, Mode: 1000001 Median : 1923110.5, Variance: 283660585606.6597, Describe: count 1670214.000 mean 1923089.135 std 532597.959 min 1000001.000 25% 1461857.250 50% 1923110.500 75% 2384279.750 max 2845382.000 Name: SK_ID_PREV, dtype: float64 ValueCounts: SK_ID_PREV 2030495 0.000 1035848 0.000 1526498 0.000 2148893 0.000 2437429 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 187868 Median : 278714.5, Variance: 10570888003.009909, Describe: count 1670214.000 mean 278357.174 std 102814.824 min 100001.000 25% 189329.000 50% 278714.500 75% 367514.000 max 456255.000 Name: SK_ID_CURR, dtype: float64 ValueCounts: SK_ID_CURR 187868 0.005 265681 0.004 173680 0.004 242412 0.004 206783 0.004 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash loans Unique: ['Consumer loans' 'Cash loans' 'Revolving loans' 'XNA'] ValueCounts: NAME_CONTRACT_TYPE Cash loans 44.758 Consumer loans 43.656 Revolving loans 11.565 XNA 0.021 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11250.0 Median : 11250.0, Variance: 173646877.76482412, Describe: count 1670214.000 mean 14906.506 std 13177.514 min 0.000 25% 7547.096 50% 11250.000 75% 16824.026 max 418058.145 Name: AMT_ANNUITY, dtype: float64 ValueCounts: AMT_ANNUITY 11250.000 23.123 2250.000 1.908 6750.000 0.805 9000.000 0.748 22500.000 0.713 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Median : 71046.0, Variance: 85719989263.5158, Describe: count 1670214.000 mean 175233.860 std 292779.762 min 0.000 25% 18720.000 50% 71046.000 75% 180360.000 max 6905160.000 Name: AMT_APPLICATION, dtype: float64 ValueCounts: AMT_APPLICATION 0.000 23.494 45000.000 2.864 225000.000 2.607 135000.000 2.435 450000.000 2.329 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1, Mode: 0.0 Median : 80541.0, Variance: 101489786307.34668, Describe: count 1670213.000 mean 196114.021 std 318574.617 min 0.000 25% 24160.500 50% 80541.000 75% 216418.500 max 6905160.000 Name: AMT_CREDIT, dtype: float64 ValueCounts: AMT_CREDIT 0.000 20.163 45000.000 2.099 225000.000 1.263 450000.000 1.195 135000.000 1.121 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 112320.0 Median : 112320.0, Variance: 78883944532.07816, Describe: count 1670214.000 mean 201181.535 std 280862.857 min 0.000 25% 67500.000 50% 112320.000 75% 180405.000 max 6905160.000 Name: AMT_GOODS_PRICE, dtype: float64 ValueCounts: AMT_GOODS_PRICE 112320.000 23.084 45000.000 2.864 225000.000 2.607 135000.000 2.435 450000.000 2.331 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: TUESDAY Unique: ['SATURDAY' 'THURSDAY' 'TUESDAY' 'MONDAY' 'FRIDAY' 'SUNDAY' 'WEDNESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START TUESDAY 15.275 WEDNESDAY 15.268 MONDAY 15.181 FRIDAY 15.091 THURSDAY 14.914 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11 Unique: [15 11 7 9 8 10 12 13 14 16 6 4 5 19 17 18 20 22 21 3 1 2 23 0] Median : 12.0, Variance: 11.11574247090721, Describe: count 1670214.000 mean 12.484 std 3.334 min 0.000 25% 10.000 50% 12.000 75% 15.000 max 23.000 Name: HOUR_APPR_PROCESS_START, dtype: float64 ValueCounts: HOUR_APPR_PROCESS_START 11 11.539 12 11.135 10 10.878 13 10.313 14 9.443 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Y Unique: ['Y' 'N'] ValueCounts: FLAG_LAST_APPL_PER_CONTRACT Y 99.493 N 0.507 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.003520005148666215, Describe: count 1670214.000 mean 0.996 std 0.059 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: NFLAG_LAST_APPL_IN_DAY, dtype: float64 ValueCounts: NFLAG_LAST_APPL_IN_DAY 1 99.647 0 0.353 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'XNA' 'Repairs' 'Everyday expenses' 'Car repairs' 'Building a house or an annex' 'Other' 'Journey' 'Purchase of electronic equipment' 'Medicine' 'Payments on other loans' 'Urgent needs' 'Buying a used car' 'Buying a new car' 'Buying a holiday home / land' 'Education' 'Buying a home' 'Furniture' 'Buying a garage' 'Business development' 'Wedding / gift / holiday' 'Hobby' 'Gasification / water supply' 'Refusal to name the goal' 'Money for a third person'] ValueCounts: NAME_CASH_LOAN_PURPOSE XAP 55.242 XNA 40.589 Repairs 1.423 Other 0.934 Urgent needs 0.504 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Approved Unique: ['Approved' 'Refused' 'Canceled' 'Unused offer'] ValueCounts: NAME_CONTRACT_STATUS Approved 62.075 Canceled 18.939 Refused 17.404 Unused offer 1.583 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 245 Median : 581.0, Variance: 606996.2907461969, Describe: count 1670214.000 mean 880.680 std 779.100 min 1.000 25% 280.000 50% 581.000 75% 1300.000 max 2922.000 Name: DAYS_DECISION, dtype: float64 ValueCounts: DAYS_DECISION 245 0.146 238 0.143 210 0.142 273 0.141 196 0.139 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash through the bank Unique: ['Cash through the bank' 'XNA' 'Non-cash from your account' 'Cashless from the account of the employer'] ValueCounts: NAME_PAYMENT_TYPE Cash through the bank 61.881 XNA 37.563 Non-cash from your account 0.491 Cashless from the account of the employer 0.065 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'HC' 'LIMIT' 'CLIENT' 'SCOFR' 'SCO' 'XNA' 'VERIF' 'SYSTEM'] ValueCounts: CODE_REJECT_REASON XAP 81.013 HC 10.492 LIMIT 3.334 SCO 2.243 CLIENT 1.583 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Unaccompanied Unique: ['Unaccompanied' 'Spouse, partner' 'Family' 'Children' 'Other_B' 'Other_A' 'Group of people'] ValueCounts: NAME_TYPE_SUITE Unaccompanied 79.593 Family 12.769 Spouse, partner 4.016 Children 1.890 Other_B 1.055 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Repeater Unique: ['Repeater' 'New' 'Refreshed' 'XNA'] ValueCounts: NAME_CLIENT_TYPE Repeater 73.719 New 18.043 Refreshed 8.122 XNA 0.116 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Mobile' 'XNA' 'Consumer Electronics' 'Construction Materials' 'Auto Accessories' 'Photo / Cinema Equipment' 'Computers' 'Audio/Video' 'Medicine' 'Clothing and Accessories' 'Furniture' 'Sport and Leisure' 'Homewares' 'Gardening' 'Jewelry' 'Vehicles' 'Education' 'Medical Supplies' 'Other' 'Direct Sales' 'Office Appliances' 'Fitness' 'Tourism' 'Insurance' 'Additional Service' 'Weapon' 'Animals' 'House Construction'] ValueCounts: NAME_GOODS_CATEGORY XNA 56.927 Mobile 13.454 Consumer Electronics 7.279 Computers 6.333 Audio/Video 5.954 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: POS Unique: ['POS' 'Cash' 'XNA' 'Cards' 'Cars'] ValueCounts: NAME_PORTFOLIO POS 41.373 Cash 27.635 XNA 22.286 Cards 8.681 Cars 0.025 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['XNA' 'x-sell' 'walk-in'] ValueCounts: NAME_PRODUCT_TYPE XNA 63.684 x-sell 27.319 walk-in 8.997 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Credit and cash offices Unique: ['Country-wide' 'Contact center' 'Credit and cash offices' 'Stone' 'Regional / Local' 'AP+ (Cash loan)' 'Channel of corporate sales' 'Car dealer'] ValueCounts: CHANNEL_TYPE Credit and cash offices 43.106 Country-wide 29.618 Stone 12.698 Regional / Local 6.498 Contact center 4.269 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -1 Median : 3.0, Variance: 50800450.2636309, Describe: count 1670214.000 mean 313.951 std 7127.443 min -1.000 25% -1.000 50% 3.000 75% 82.000 max 4000000.000 Name: SELLERPLACE_AREA, dtype: float64 ValueCounts: SELLERPLACE_AREA -1 45.663 0 3.624 50 2.239 30 2.061 20 2.026 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Connectivity' 'XNA' 'Consumer electronics' 'Industry' 'Clothing' 'Furniture' 'Construction' 'Jewelry' 'Auto technology' 'MLM partners' 'Tourism'] ValueCounts: NAME_SELLER_INDUSTRY XNA 51.234 Consumer electronics 23.845 Connectivity 16.527 Furniture 3.464 Construction 1.783 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 372230, Mode: 12.0 Median : 12.0, Variance: 212.2058722166199, Describe: count 1297984.000 mean 16.054 std 14.567 min 0.000 25% 6.000 50% 12.000 75% 24.000 max 84.000 Name: CNT_PAYMENT, dtype: float64 ValueCounts: CNT_PAYMENT 12.000 24.889 6.000 14.674 0.000 11.170 10.000 10.929 24.000 10.614 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['middle' 'low_action' 'high' 'low_normal' 'XNA'] ValueCounts: NAME_YIELD_GROUP XNA 30.967 middle 23.083 high 21.155 low_normal 19.285 low_action 5.511 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 346, Mode: Cash Unique: ['POS mobile with interest' 'Cash X-Sell: low' 'Cash X-Sell: high' 'Cash X-Sell: middle' 'Cash Street: high' 'Cash' 'POS household without interest' 'POS household with interest' 'POS other with interest' 'Card X-Sell' 'POS mobile without interest' 'Card Street' 'POS industry with interest' 'Cash Street: low' 'POS industry without interest' 'Cash Street: middle' 'POS others without interest' nan] ValueCounts: PRODUCT_COMBINATION Cash 17.127 POS household with interest 15.787 POS mobile with interest 13.215 Cash X-Sell: middle 8.616 Cash X-Sell: low 7.800 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 365243 Median : 365243.0, Variance: 4793060111.321972, Describe: count 1670214.000 mean 351569.514 std 69231.930 min 2.000 25% 365243.000 50% 365243.000 75% 365243.000 max 365243.000 Name: DAYS_FIRST_DRAWING, dtype: float64 ValueCounts: DAYS_FIRST_DRAWING 365243 96.246 228 0.007 224 0.007 212 0.007 223 0.007 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 831 Median : 831.0, Variance: 3150557682.1309857, Describe: count 1670214.000 mean 9856.863 std 56129.829 min 2.000 25% 752.000 50% 831.000 75% 1132.000 max 365243.000 Name: DAYS_FIRST_DUE, dtype: float64 ValueCounts: DAYS_FIRST_DUE 831 40.331 365243 2.434 334 0.046 509 0.046 208 0.045 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 361 Median : 361.0, Variance: 7050978795.539533, Describe: count 1670214.000 mean 21138.662 std 83970.107 min 0.000 25% 361.000 50% 361.000 75% 996.000 max 365243.000 Name: DAYS_LAST_DUE_1ST_VERSION, dtype: float64 ValueCounts: DAYS_LAST_DUE_1ST_VERSION 361 40.339 365243 5.620 2 0.082 1 0.081 8 0.081 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 537 Median : 537.0, Variance: 14674120897.696123, Describe: count 1670214.000 mean 46875.043 std 121136.786 min 2.000 25% 537.000 50% 537.000 75% 1518.000 max 365243.000 Name: DAYS_LAST_DUE, dtype: float64 ValueCounts: DAYS_LAST_DUE 537 40.328 365243 12.646 245 0.039 188 0.039 239 0.038 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 499 Median : 499.0, Variance: 15539251588.869661, Describe: count 1670214.000 mean 50055.597 std 124656.534 min 2.000 25% 499.000 50% 499.000 75% 1544.000 max 365243.000 Name: DAYS_TERMINATION, dtype: float64 ValueCounts: DAYS_TERMINATION 499 40.330 365243 13.526 233 0.047 170 0.046 184 0.046 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1.] Median : 0.0, Variance: 0.15912835746123868, Describe: count 1670214.000 mean 0.199 std 0.399 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: NFLAG_INSURED_ON_APPROVAL, dtype: float64 ValueCounts: NFLAG_INSURED_ON_APPROVAL 0.000 80.145 1.000 19.855 Name: proportion, dtype: float64 ------------------------------------------------------------------
CNT_PAYMENT 22.286 PRODUCT_COMBINATION 0.021 AMT_CREDIT 0.000 NAME_YIELD_GROUP 0.000 NAME_PORTFOLIO 0.000 NAME_PRODUCT_TYPE 0.000 CHANNEL_TYPE 0.000 SELLERPLACE_AREA 0.000 NAME_SELLER_INDUSTRY 0.000 SK_ID_PREV 0.000 NAME_CLIENT_TYPE 0.000 DAYS_FIRST_DRAWING 0.000 DAYS_FIRST_DUE 0.000 DAYS_LAST_DUE_1ST_VERSION 0.000 DAYS_LAST_DUE 0.000 DAYS_TERMINATION 0.000 NAME_GOODS_CATEGORY 0.000 NAME_TYPE_SUITE 0.000 SK_ID_CURR 0.000 CODE_REJECT_REASON 0.000 NAME_PAYMENT_TYPE 0.000 DAYS_DECISION 0.000 NAME_CONTRACT_STATUS 0.000 NAME_CASH_LOAN_PURPOSE 0.000 NFLAG_LAST_APPL_IN_DAY 0.000 FLAG_LAST_APPL_PER_CONTRACT 0.000 HOUR_APPR_PROCESS_START 0.000 WEEKDAY_APPR_PROCESS_START 0.000 AMT_GOODS_PRICE 0.000 AMT_APPLICATION 0.000 AMT_ANNUITY 0.000 NAME_CONTRACT_TYPE 0.000 NFLAG_INSURED_ON_APPROVAL 0.000 dtype: float64
prev_appl_data1['CNT_PAYMENT']= prev_appl_data1['CNT_PAYMENT'].fillna(prev_appl_data1['CNT_PAYMENT'].mode()[0])
prev_appl_data1['PRODUCT_COMBINATION'] = prev_appl_data1['PRODUCT_COMBINATION'].fillna(prev_appl_data1['PRODUCT_COMBINATION'].mode()[0])
Data Imbalance¶
# curr_appl_data1['TARGET'].head(10)
target_ratio_one = (curr_appl_data1['TARGET'] == 1).sum()
target_ratio_zero = (curr_appl_data1['TARGET'] == 0).sum()
target_ratio = target_ratio_zero/target_ratio_one
target_ratio
curr_appl_data1['TARGET'].value_counts(normalize=True).plot.bar()
11.387150050352467
<Axes: xlabel='TARGET'>
Merged/Combined Data Analysis¶
combined_appl_data = pd.merge(left=curr_appl_data1, right=prev_appl_data1, on="SK_ID_CURR", how='inner', suffixes=[None,'_right'])
combined_appl_data.head()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | YEARS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | FLOORSMAX | YEARS_BEGINEXPLOITATION | AMT_CREDIT_BINS | YEARS_EMPLOYED_BINS | AGE_Category | SK_ID_PREV | NAME_CONTRACT_TYPE_right | AMT_ANNUITY_right | AMT_APPLICATION | AMT_CREDIT_right | AMT_GOODS_PRICE_right | WEEKDAY_APPR_PROCESS_START_right | HOUR_APPR_PROCESS_START_right | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE_right | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100002 | 1 | Cash loans | M | No | Yes | 0 | 202500.000 | 406597.500 | 24700.500 | 351000.000 | Unaccompanied | Working | Secondary | Single | House | 0.019 | 25 | 1 | 9 | 5 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.000 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.263 | 0.139 | 0.015 | No | 2.000 | 2.000 | 2.000 | 2.000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.083 | 0.972 | 400-600k | 0-10 | <30 | 1038818 | Consumer loans | 9251.775 | 179055.000 | 179055.000 | 179055.000 | SATURDAY | 9 | Y | 1 | XAP | Approved | 606 | XNA | XAP | Unaccompanied | New | Vehicles | POS | XNA | Stone | 500 | Auto technology | 24.000 | low_normal | POS other with interest | 365243 | 565 | 125 | 25 | 17 | 0.000 |
1 | 100003 | 0 | Cash loans | F | No | No | 0 | 270000.000 | 1293502.500 | 35698.500 | 1129500.000 | Family | State servant | Higher education | Married | House | 0.004 | 45 | 3 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.000 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622 | 0.535 | 0.071 | No | 1.000 | 0.000 | 1.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.292 | 0.985 | 1M+ | 0-10 | 40-50 | 1810518 | Cash loans | 98356.995 | 900000.000 | 1035882.000 | 900000.000 | FRIDAY | 12 | Y | 1 | XNA | Approved | 746 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.000 | low_normal | Cash X-Sell: low | 365243 | 716 | 386 | 536 | 527 | 1.000 |
2 | 100003 | 0 | Cash loans | F | No | No | 0 | 270000.000 | 1293502.500 | 35698.500 | 1129500.000 | Family | State servant | Higher education | Married | House | 0.004 | 45 | 3 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.000 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622 | 0.535 | 0.071 | No | 1.000 | 0.000 | 1.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.292 | 0.985 | 1M+ | 0-10 | 40-50 | 2636178 | Consumer loans | 64567.665 | 337500.000 | 348637.500 | 337500.000 | SUNDAY | 17 | Y | 1 | XAP | Approved | 828 | Cash through the bank | XAP | Family | Refreshed | Furniture | POS | XNA | Stone | 1400 | Furniture | 6.000 | middle | POS industry with interest | 365243 | 797 | 647 | 647 | 639 | 0.000 |
3 | 100003 | 0 | Cash loans | F | No | No | 0 | 270000.000 | 1293502.500 | 35698.500 | 1129500.000 | Family | State servant | Higher education | Married | House | 0.004 | 45 | 3 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.000 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622 | 0.535 | 0.071 | No | 1.000 | 0.000 | 1.000 | 0.000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.292 | 0.985 | 1M+ | 0-10 | 40-50 | 2396755 | Consumer loans | 6737.310 | 68809.500 | 68053.500 | 68809.500 | SATURDAY | 15 | Y | 1 | XAP | Approved | 2341 | Cash through the bank | XAP | Family | Refreshed | Consumer Electronics | POS | XNA | Country-wide | 200 | Consumer electronics | 12.000 | middle | POS household with interest | 365243 | 2310 | 1980 | 1980 | 1976 | 1.000 |
4 | 100004 | 0 | Revolving loans | M | Yes | Yes | 0 | 67500.000 | 135000.000 | 6750.000 | 135000.000 | Unaccompanied | Working | Secondary | Single | House | 0.010 | 52 | 0 | 11 | 6 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.000 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.556 | 0.730 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 | 0-200K | 0-10 | 50-60 | 1564014 | Consumer loans | 5357.250 | 24282.000 | 20106.000 | 24282.000 | FRIDAY | 5 | Y | 1 | XAP | Approved | 815 | Cash through the bank | XAP | Unaccompanied | New | Mobile | POS | XNA | Regional / Local | 30 | Connectivity | 4.000 | middle | POS mobile without interest | 365243 | 784 | 694 | 724 | 714 | 0.000 |
show_stats(combined_appl_data, combined_appl_data.columns)
check_cols_null_pct(combined_appl_data)
Total Nulls: 0, Mode: 265681 Median : 278992.0, Variance: 10570263760.57637, Describe: count 1413701.000 mean 278481.288 std 102811.788 min 100002.000 25% 189364.000 50% 278992.000 75% 367556.000 max 456255.000 Name: SK_ID_CURR, dtype: float64 ValueCounts: SK_ID_CURR 265681 0.005 173680 0.005 242412 0.005 206783 0.005 389950 0.005 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [1 0] Median : 0.0, Variance: 0.07906159747213361, Describe: count 1413701.000 mean 0.087 std 0.281 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: TARGET, dtype: float64 ValueCounts: TARGET 0 91.345 1 8.655 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash loans Unique: ['Cash loans' 'Revolving loans'] ValueCounts: NAME_CONTRACT_TYPE Cash loans 92.460 Revolving loans 7.540 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: F Unique: ['M' 'F' 'XNA'] ValueCounts: CODE_GENDER F 67.563 M 32.433 XNA 0.004 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: No Unique: ['No' 'Yes'] ValueCounts: FLAG_OWN_CAR No 66.292 Yes 33.708 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Yes Unique: ['Yes' 'No'] ValueCounts: FLAG_OWN_REALTY Yes 72.440 No 27.560 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [ 0 1 2 3 4 7 5 6 8 9 11 12 10 19 14] Median : 0.0, Variance: 0.5145843604747418, Describe: count 1413701.000 mean 0.405 std 0.717 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 19.000 Name: CNT_CHILDREN, dtype: float64 ValueCounts: CNT_CHILDREN 0 71.080 1 19.003 2 8.519 3 1.220 4 0.136 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 135000.0 Median : 157500.0, Variance: 39431384073.85981, Describe: count 1413701.000 mean 173316.044 std 198573.372 min 25650.000 25% 112500.000 50% 157500.000 75% 207000.000 max 117000000.000 Name: AMT_INCOME_TOTAL, dtype: float64 ValueCounts: AMT_INCOME_TOTAL 135000.000 11.492 112500.000 9.665 157500.000 9.071 180000.000 8.473 225000.000 7.190 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 450000.0 Median : 508495.5, Variance: 148161352000.59268, Describe: count 1413701.000 mean 587553.673 std 384917.331 min 45000.000 25% 270000.000 50% 508495.500 75% 807984.000 max 4050000.000 Name: AMT_CREDIT, dtype: float64 ValueCounts: AMT_CREDIT 450000.000 3.157 225000.000 2.725 675000.000 2.613 270000.000 1.973 900000.000 1.862 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 13500.0 Median : 24930.0, Variance: 194622195.8703764, Describe: count 1413701.000 mean 27017.022 std 13950.706 min 1615.500 25% 16821.000 50% 24930.000 75% 34542.000 max 225000.000 Name: AMT_ANNUITY, dtype: float64 ValueCounts: AMT_ANNUITY 13500.000 1.396 9000.000 1.329 10125.000 0.514 6750.000 0.483 20250.000 0.481 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 225000.0 Median : 450000.0, Variance: 124676554792.54468, Describe: count 1413701.000 mean 527727.717 std 353095.674 min 40500.000 25% 238500.000 50% 450000.000 75% 679500.000 max 4050000.000 Name: AMT_GOODS_PRICE, dtype: float64 ValueCounts: AMT_GOODS_PRICE 225000.000 8.214 450000.000 8.112 675000.000 7.129 900000.000 4.528 270000.000 3.285 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Unaccompanied Unique: ['Unaccompanied' 'Family' 'Spouse' 'Children' 'Others' 'Group of people'] ValueCounts: NAME_TYPE_SUITE Unaccompanied 81.906 Family 13.034 Spouse 3.300 Children 0.984 Others 0.690 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Working Unique: ['Working' 'State servant' 'Commercial associate' 'Pensioner' 'Unemployed' 'Student' 'Maternity leave'] ValueCounts: NAME_INCOME_TYPE Working 51.219 Commercial associate 22.740 Pensioner 19.349 State servant 6.679 Unemployed 0.009 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Secondary Unique: ['Secondary' 'Higher education' 'Incomplete higher' 'Lower secondary' 'Academic degree'] ValueCounts: NAME_EDUCATION_TYPE Secondary 73.417 Higher education 22.130 Incomplete higher 3.198 Lower secondary 1.214 Academic degree 0.041 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Married Unique: ['Single' 'Married' 'Civil marriage' 'Widow' 'Separated'] ValueCounts: NAME_FAMILY_STATUS Married 64.434 Single 13.203 Civil marriage 10.126 Separated 6.461 Widow 5.775 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: House Unique: ['House' 'Rented apartment' 'With parents' 'Municipal apartment' 'Office apartment' 'Co-op apartment'] ValueCounts: NAME_HOUSING_TYPE House 89.478 With parents 4.358 Municipal apartment 3.645 Rented apartment 1.400 Office apartment 0.808 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.035792 Median : 0.01885, Variance: 0.00017814293869714113, Describe: count 1413701.000 mean 0.021 std 0.013 min 0.000 25% 0.010 50% 0.019 75% 0.029 max 0.073 Name: REGION_POPULATION_RELATIVE, dtype: float64 ValueCounts: REGION_POPULATION_RELATIVE 0.036 5.217 0.025 4.559 0.026 4.087 0.046 3.998 0.029 3.951 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 39 Median : 43.0, Variance: 141.66155639435718, Describe: count 1413701.000 mean 44.214 std 11.902 min 20.000 25% 34.000 50% 43.000 75% 54.000 max 69.000 Name: YEARS_BIRTH, dtype: float64 ValueCounts: YEARS_BIRTH 39 2.905 38 2.871 40 2.836 37 2.813 43 2.773 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1000 Median : 6.0, Variance: 154164.41352542056, Describe: count 1413701.000 mean 198.550 std 392.638 min 0.000 25% 2.000 50% 6.000 75% 17.000 max 1000.000 Name: YEARS_EMPLOYED, dtype: float64 ValueCounts: YEARS_EMPLOYED 1000 19.352 1 9.566 2 9.120 3 8.048 0 7.594 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Median : 12.0, Variance: 94.58338443246285, Describe: count 1413701.000 mean 13.217 std 9.725 min 0.000 25% 5.000 50% 12.000 75% 20.000 max 67.000 Name: YEARS_REGISTRATION, dtype: float64 ValueCounts: YEARS_REGISTRATION 0 5.594 1 5.059 2 4.859 3 4.202 12 4.154 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11 Unique: [ 5 0 6 9 1 10 2 8 13 3 7 12 11 4 14 15 17 16 19] Median : 9.0, Variance: 17.0505687512264, Describe: count 1413701.000 mean 7.824 std 4.129 min 0.000 25% 4.000 50% 9.000 75% 11.000 max 19.000 Name: YEARS_ID_PUBLISH, dtype: float64 ValueCounts: YEARS_ID_PUBLISH 11 14.803 12 12.820 10 7.013 13 6.923 9 6.340 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1] Median : 1.0, Variance: 0.0, Describe: count 1413701.000 mean 1.000 std 0.000 min 1.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_MOBIL, dtype: float64 ValueCounts: FLAG_MOBIL 1 100.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.15609508527748012, Describe: count 1413701.000 mean 0.806 std 0.395 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_EMP_PHONE, dtype: float64 ValueCounts: FLAG_EMP_PHONE 1 80.644 0 19.356 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.14904043394825833, Describe: count 1413701.000 mean 0.182 std 0.386 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_WORK_PHONE, dtype: float64 ValueCounts: FLAG_WORK_PHONE 0 81.774 1 18.226 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.0007541873628204551, Describe: count 1413701.000 mean 0.999 std 0.027 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_CONT_MOBILE, dtype: float64 ValueCounts: FLAG_CONT_MOBILE 1 99.925 0 0.075 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [1 0] Median : 0.0, Variance: 0.20285697911891543, Describe: count 1413701.000 mean 0.283 std 0.450 min 0.000 25% 0.000 50% 0.000 75% 1.000 max 1.000 Name: FLAG_PHONE, dtype: float64 ValueCounts: FLAG_PHONE 0 71.712 1 28.288 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.06718311789093928, Describe: count 1413701.000 mean 0.072 std 0.259 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_EMAIL, dtype: float64 ValueCounts: FLAG_EMAIL 0 92.757 1 7.243 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Pensioner Unique: ['Laborers' 'Core staff' 'Accountants' 'Managers' 'Pensioner' 'Drivers' 'Sales staff' 'Cleaning staff' 'Private service staff' 'Medicine staff' 'Security staff' 'Cooking staff' 'High skill tech staff' 'Waiters/barmen staff' 'Low-skill Laborers' 'Realty agents' 'Secretaries' 'IT staff' 'HR staff'] ValueCounts: OCCUPATION_TYPE Pensioner 32.368 Laborers 17.776 Sales staff 10.707 Core staff 8.366 Managers 6.716 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2.0 Unique: [ 1. 2. 3. 4. 5. 6. 9. 7. 8. 10. 13. 14. 12. 20. 15. 16. 11.] Median : 2.0, Variance: 0.8112221242618761, Describe: count 1413701.000 mean 2.151 std 0.901 min 1.000 25% 2.000 50% 2.000 75% 3.000 max 20.000 Name: CNT_FAM_MEMBERS, dtype: float64 ValueCounts: CNT_FAM_MEMBERS 2.000 52.886 1.000 21.414 3.000 16.493 4.000 7.895 5.000 1.142 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2 Unique: [2 1 3] Median : 2.0, Variance: 0.24579750875361692, Describe: count 1413701.000 mean 2.060 std 0.496 min 1.000 25% 2.000 50% 2.000 75% 2.000 max 3.000 Name: REGION_RATING_CLIENT, dtype: float64 ValueCounts: REGION_RATING_CLIENT 2 75.065 3 15.450 1 9.486 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 2 Unique: [2 1 3] Median : 2.0, Variance: 0.23877207124135386, Describe: count 1413701.000 mean 2.038 std 0.489 min 1.000 25% 2.000 50% 2.000 75% 2.000 max 3.000 Name: REGION_RATING_CLIENT_W_CITY, dtype: float64 ValueCounts: REGION_RATING_CLIENT_W_CITY 2 75.981 3 13.889 1 10.129 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: TUESDAY Unique: ['WEDNESDAY' 'MONDAY' 'THURSDAY' 'SUNDAY' 'SATURDAY' 'FRIDAY' 'TUESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START TUESDAY 17.587 WEDNESDAY 16.708 MONDAY 16.587 THURSDAY 16.289 FRIDAY 16.287 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 10 Unique: [10 11 9 17 16 14 8 15 7 13 6 12 19 3 18 21 4 5 20 22 1 2 23 0] Median : 12.0, Variance: 10.446994996371068, Describe: count 1413701.000 mean 11.984 std 3.232 min 0.000 25% 10.000 50% 12.000 75% 14.000 max 23.000 Name: HOUR_APPR_PROCESS_START, dtype: float64 ValueCounts: HOUR_APPR_PROCESS_START 10 12.541 11 12.148 12 11.256 13 9.989 9 9.249 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.011927518818607137, Describe: count 1413701.000 mean 0.012 std 0.109 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_REGION_NOT_LIVE_REGION, dtype: float64 ValueCounts: REG_REGION_NOT_LIVE_REGION 0 98.793 1 1.207 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.042033808859841176, Describe: count 1413701.000 mean 0.044 std 0.205 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_REGION_NOT_WORK_REGION, dtype: float64 ValueCounts: REG_REGION_NOT_WORK_REGION 0 95.603 1 4.397 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.03474850480857744, Describe: count 1413701.000 mean 0.036 std 0.186 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: LIVE_REGION_NOT_WORK_REGION, dtype: float64 ValueCounts: LIVE_REGION_NOT_WORK_REGION 0 96.395 1 3.605 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.06818767668337616, Describe: count 1413701.000 mean 0.074 std 0.261 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_CITY_NOT_LIVE_CITY, dtype: float64 ValueCounts: REG_CITY_NOT_LIVE_CITY 0 92.639 1 7.361 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.1727984473888169, Describe: count 1413701.000 mean 0.222 std 0.416 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: REG_CITY_NOT_WORK_CITY, dtype: float64 ValueCounts: REG_CITY_NOT_WORK_CITY 0 77.785 1 22.215 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.14376331213040563, Describe: count 1413701.000 mean 0.174 std 0.379 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: LIVE_CITY_NOT_WORK_CITY, dtype: float64 ValueCounts: LIVE_CITY_NOT_WORK_CITY 0 82.594 1 17.406 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Business Entity Type 3 ValueCounts: ORGANIZATION_TYPE Business Entity Type 3 21.567 XNA 19.352 Self-employed 12.952 Other 5.201 Medicine 3.684 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.2858978721410488 Median : 0.5630476928116712, Variance: 0.03648315730123363, Describe: count 1413701.000 mean 0.511 std 0.191 min 0.000 25% 0.389 50% 0.563 75% 0.661 max 0.855 Name: EXT_SOURCE_2, dtype: float64 ValueCounts: EXT_SOURCE_2 0.286 0.199 0.262 0.143 0.566 0.129 0.265 0.120 0.265 0.111 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.5352762504724826 Median : 0.5352762504724826, Variance: 0.03231313722977966, Describe: count 1413701.000 mean 0.497 std 0.180 min 0.001 25% 0.381 50% 0.535 75% 0.621 max 0.896 Name: EXT_SOURCE_3, dtype: float64 ValueCounts: EXT_SOURCE_3 0.535 17.362 0.746 0.406 0.671 0.400 0.555 0.391 0.694 0.367 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.10254666268544127 Median : 0.10254666268544127, Variance: 0.005621824784519625, Describe: count 1413701.000 mean 0.101 std 0.075 min 0.000 25% 0.066 50% 0.103 75% 0.103 max 1.000 Name: TOTALAREA_MODE, dtype: float64 ValueCounts: TOTALAREA_MODE 0.103 47.927 0.000 0.194 0.056 0.082 0.056 0.080 0.055 0.080 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: No Unique: ['No' 'Yes'] ValueCounts: EMERGENCYSTATE_MODE No 99.199 Yes 0.801 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 1. 0. 4. 8. 7. 3. 6. 5. 12. 9. 10. 13. 11. 14. 22. 16. 15. 17. 20. 25. 19. 18. 21. 24. 23. 28. 26. 29. 27. 47. 348. 30.] Median : 0.0, Variance: 6.395559820743489, Describe: count 1413701.000 mean 1.541 std 2.529 min 0.000 25% 0.000 50% 0.000 75% 2.000 max 348.000 Name: OBS_30_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: OBS_30_CNT_SOCIAL_CIRCLE 0.000 52.068 1.000 15.640 2.000 9.778 3.000 6.813 4.000 4.813 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 0. 1. 3. 4. 5. 6. 7. 34. 8.] Median : 0.0, Variance: 0.21662998541204448, Describe: count 1413701.000 mean 0.154 std 0.465 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 34.000 Name: DEF_30_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: DEF_30_CNT_SOCIAL_CIRCLE 0.000 87.858 1.000 9.696 2.000 1.844 3.000 0.459 4.000 0.117 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 1. 0. 4. 8. 7. 3. 6. 5. 12. 9. 10. 13. 11. 14. 21. 15. 22. 16. 20. 25. 17. 19. 18. 24. 23. 28. 29. 27. 47. 344. 30. 26.] Median : 0.0, Variance: 6.286009401273656, Describe: count 1413701.000 mean 1.523 std 2.507 min 0.000 25% 0.000 50% 0.000 75% 2.000 max 344.000 Name: OBS_60_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: OBS_60_CNT_SOCIAL_CIRCLE 0.000 52.301 1.000 15.677 2.000 9.776 3.000 6.778 4.000 4.756 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 2. 0. 1. 3. 5. 4. 7. 24. 6.] Median : 0.0, Variance: 0.1433917755687901, Describe: count 1413701.000 mean 0.108 std 0.379 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 24.000 Name: DEF_60_CNT_SOCIAL_CIRCLE, dtype: float64 ValueCounts: DEF_60_CNT_SOCIAL_CIRCLE 0.000 91.042 1.000 7.527 2.000 1.127 3.000 0.229 4.000 0.065 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [ 3 2 1 6 4 0 7 5 8 9 10 11] Median : 2.0, Variance: 4.562196038297019, Describe: count 1413701.000 mean 2.536 std 2.136 min 0.000 25% 1.000 50% 2.000 75% 4.000 max 11.000 Name: YEARS_LAST_PHONE_CHANGE, dtype: float64 ValueCounts: YEARS_LAST_PHONE_CHANGE 0 23.541 4 16.586 1 16.458 2 12.767 3 11.880 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 7.07313619368962e-05, Describe: count 1413701.000 mean 0.000 std 0.008 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_2, dtype: float64 ValueCounts: FLAG_DOCUMENT_2 0 99.993 1 0.007 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.19308926654498165, Describe: count 1413701.000 mean 0.739 std 0.439 min 0.000 25% 0.000 50% 1.000 75% 1.000 max 1.000 Name: FLAG_DOCUMENT_3, dtype: float64 ValueCounts: FLAG_DOCUMENT_3 1 73.856 0 26.144 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 7.638943857510745e-05, Describe: count 1413701.000 mean 0.000 std 0.009 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_4, dtype: float64 ValueCounts: FLAG_DOCUMENT_4 0 99.992 1 0.008 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.01350839549939064, Describe: count 1413701.000 mean 0.014 std 0.116 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_5, dtype: float64 ValueCounts: FLAG_DOCUMENT_5 0 98.630 1 1.370 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.08142636131957906, Describe: count 1413701.000 mean 0.089 std 0.285 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_6, dtype: float64 ValueCounts: FLAG_DOCUMENT_6 0 91.058 1 8.942 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0002128711269718906, Describe: count 1413701.000 mean 0.000 std 0.015 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_7, dtype: float64 ValueCounts: FLAG_DOCUMENT_7 0 99.979 1 0.021 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.06888615057472505, Describe: count 1413701.000 mean 0.074 std 0.262 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_8, dtype: float64 ValueCounts: FLAG_DOCUMENT_8 0 92.557 1 7.443 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.003172301239052705, Describe: count 1413701.000 mean 0.003 std 0.056 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_9, dtype: float64 ValueCounts: FLAG_DOCUMENT_9 0 99.682 1 0.318 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 5.587860603198018e-05, Describe: count 1413701.000 mean 0.000 std 0.007 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_10, dtype: float64 ValueCounts: FLAG_DOCUMENT_10 0 99.994 1 0.006 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.001699725391659478, Describe: count 1413701.000 mean 0.002 std 0.041 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_11, dtype: float64 ValueCounts: FLAG_DOCUMENT_11 0 99.830 1 0.170 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 1.414725309908651e-06, Describe: count 1413701.000 mean 0.000 std 0.001 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_12, dtype: float64 ValueCounts: FLAG_DOCUMENT_12 0 100.000 1 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.001780788588738822, Describe: count 1413701.000 mean 0.002 std 0.042 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_13, dtype: float64 ValueCounts: FLAG_DOCUMENT_13 0 99.822 1 0.178 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0014268329032509246, Describe: count 1413701.000 mean 0.001 std 0.038 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_14, dtype: float64 ValueCounts: FLAG_DOCUMENT_14 0 99.857 1 0.143 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0006567090023151083, Describe: count 1413701.000 mean 0.001 std 0.026 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_15, dtype: float64 ValueCounts: FLAG_DOCUMENT_15 0 99.934 1 0.066 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.005970150089362391, Describe: count 1413701.000 mean 0.006 std 0.077 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_16, dtype: float64 ValueCounts: FLAG_DOCUMENT_16 0 99.399 1 0.601 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00014215987961972622, Describe: count 1413701.000 mean 0.000 std 0.012 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_17, dtype: float64 ValueCounts: FLAG_DOCUMENT_17 0 99.986 1 0.014 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0041784741202988175, Describe: count 1413701.000 mean 0.004 std 0.065 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_18, dtype: float64 ValueCounts: FLAG_DOCUMENT_18 0 99.580 1 0.420 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.0003987939383243309, Describe: count 1413701.000 mean 0.000 std 0.020 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_19, dtype: float64 ValueCounts: FLAG_DOCUMENT_19 0 99.960 1 0.040 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00036698695901663507, Describe: count 1413701.000 mean 0.000 std 0.019 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_20, dtype: float64 ValueCounts: FLAG_DOCUMENT_20 0 99.963 1 0.037 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0 Unique: [0 1] Median : 0.0, Variance: 0.00022418399786824068, Describe: count 1413701.000 mean 0.000 std 0.015 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: FLAG_DOCUMENT_21, dtype: float64 ValueCounts: FLAG_DOCUMENT_21 0 99.978 1 0.022 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1. 2. 3. 4.] Median : 0.0, Variance: 0.005932991534707245, Describe: count 1413701.000 mean 0.005 std 0.077 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 4.000 Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_HOUR 0.000 99.473 1.000 0.508 2.000 0.017 3.000 0.002 4.000 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1. 3. 2. 4. 5. 6. 9.] Median : 0.0, Variance: 0.010039349307693738, Describe: count 1413701.000 mean 0.006 std 0.100 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 9.000 Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_DAY 0.000 99.495 1.000 0.455 2.000 0.027 3.000 0.012 4.000 0.006 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1. 3. 2. 4. 5. 6. 8. 7.] Median : 0.0, Variance: 0.04051775033978972, Describe: count 1413701.000 mean 0.034 std 0.201 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 8.000 Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_WEEK 0.000 96.782 1.000 3.110 2.000 0.070 3.000 0.017 4.000 0.011 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 0. 1. 2. 5. 3. 7. 9. 4. 11. 8. 6. 16. 12. 14. 10. 13. 17. 24. 19. 15. 23. 18. 27. 22.] Median : 0.0, Variance: 0.8590375176640723, Describe: count 1413701.000 mean 0.266 std 0.927 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 27.000 Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_MON 0.000 83.704 1.000 12.312 2.000 2.036 3.000 0.737 4.000 0.393 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 0. 1. 2. 4. 3. 8. 5. 6. 7. 261.] Median : 0.0, Variance: 0.7711375890103405, Describe: count 1413701.000 mean 0.320 std 0.878 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 261.000 Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_QRT 0.000 78.751 1.000 12.597 2.000 7.243 3.000 0.977 4.000 0.347 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [ 1. 0. 2. 4. 5. 3. 8. 6. 9. 7. 10. 11. 13. 16. 12. 25. 23. 15. 14. 22. 17. 19. 18. 21. 20.] Median : 2.0, Variance: 5.147256027837513, Describe: count 1413701.000 mean 2.575 std 2.269 min 0.000 25% 1.000 50% 2.000 75% 4.000 max 25.000 Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64 ValueCounts: AMT_REQ_CREDIT_BUREAU_YEAR 0.000 22.540 2.000 16.713 1.000 15.606 3.000 14.889 4.000 10.984 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.1667 Median : 0.1667, Variance: 0.010712743464371973, Describe: count 1413701.000 mean 0.194 std 0.104 min 0.000 25% 0.167 50% 0.167 75% 0.167 max 1.000 Name: FLOORSMAX, dtype: float64 ValueCounts: FLOORSMAX 0.167 69.894 0.333 10.404 0.042 4.874 0.375 2.500 0.125 2.266 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.9775174983915752 Median : 0.9775174983915752, Variance: 0.002071626816730406, Describe: count 1413701.000 mean 0.977 std 0.046 min 0.000 25% 0.978 50% 0.978 75% 0.982 max 1.000 Name: YEARS_BEGINEXPLOITATION, dtype: float64 ValueCounts: YEARS_BEGINEXPLOITATION 0.978 48.432 0.987 1.319 0.986 1.270 0.980 1.261 0.987 1.256 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 200-400k Unique: ['400-600k', '1M+', '0-200K', '200-400k', '600-800k', '800-1M'] Categories (6, object): ['0-200K' < '200-400k' < '400-600k' < '600-800k' < '800-1M' < '1M+'] ValueCounts: AMT_CREDIT_BINS 200-400k 26.851 400-600k 22.229 1M+ 15.415 600-800k 14.226 0-200K 11.473 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0-10 Unique: ['0-10', '60+', '20-30', '10-20', '30-40', '40-50'] Categories (7, object): ['0-10' < '10-20' < '20-30' < '30-40' < '40-50' < '50-60' < '60+'] ValueCounts: YEARS_EMPLOYED_BINS 0-10 66.419 60+ 19.352 10-20 10.768 20-30 2.647 30-40 0.758 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 30-40 Unique: ['<30', '40-50', '50-60', '30-40', '60+'] Categories (5, object): ['<30' < '30-40' < '40-50' < '50-60' < '60+'] ValueCounts: AGE_Category 30-40 26.579 40-50 24.657 50-60 22.668 <30 15.256 60+ 10.840 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1000001 Median : 1922698.0, Variance: 283785628415.63196, Describe: count 1413701.000 mean 1922744.331 std 532715.335 min 1000001.000 25% 1461346.000 50% 1922698.000 75% 2384012.000 max 2845381.000 Name: SK_ID_PREV, dtype: float64 ValueCounts: SK_ID_PREV 1038818 0.000 2734700 0.000 2622697 0.000 1634932 0.000 2177866 0.000 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash loans Unique: ['Consumer loans' 'Cash loans' 'Revolving loans' 'XNA'] ValueCounts: NAME_CONTRACT_TYPE_right Cash loans 44.335 Consumer loans 44.228 Revolving loans 11.415 XNA 0.022 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11250.0 Median : 11250.0, Variance: 173283112.1855973, Describe: count 1413701.000 mean 14840.322 std 13163.704 min 0.000 25% 7406.055 50% 11250.000 75% 16747.965 max 418058.145 Name: AMT_ANNUITY_right, dtype: float64 ValueCounts: AMT_ANNUITY_right 11250.000 22.583 2250.000 1.865 6750.000 0.818 9000.000 0.750 22500.000 0.718 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Median : 70870.5, Variance: 86213992304.8011, Describe: count 1413701.000 mean 175243.594 std 293622.193 min 0.000 25% 19750.500 50% 70870.500 75% 180000.000 max 5850000.000 Name: AMT_APPLICATION, dtype: float64 ValueCounts: AMT_APPLICATION 0.000 23.011 45000.000 2.836 225000.000 2.612 135000.000 2.469 450000.000 2.320 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 1, Mode: 0.0 Median : 80595.0, Variance: 102068269559.04918, Describe: count 1413700.000 mean 196354.086 std 319481.251 min 0.000 25% 24880.500 50% 80595.000 75% 215640.000 max 4509688.500 Name: AMT_CREDIT_right, dtype: float64 ValueCounts: AMT_CREDIT_right 0.000 19.564 45000.000 2.060 225000.000 1.272 450000.000 1.193 135000.000 1.142 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 112320.0 Median : 112320.0, Variance: 79534684972.55249, Describe: count 1413701.000 mean 200655.235 std 282018.944 min 0.000 25% 66375.000 50% 112320.000 75% 180000.000 max 5850000.000 Name: AMT_GOODS_PRICE_right, dtype: float64 ValueCounts: AMT_GOODS_PRICE_right 112320.000 22.604 45000.000 2.836 225000.000 2.612 135000.000 2.468 450000.000 2.321 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: WEDNESDAY Unique: ['SATURDAY' 'FRIDAY' 'SUNDAY' 'THURSDAY' 'TUESDAY' 'MONDAY' 'WEDNESDAY'] ValueCounts: WEEKDAY_APPR_PROCESS_START_right WEDNESDAY 15.245 TUESDAY 15.212 MONDAY 15.174 FRIDAY 15.093 THURSDAY 14.926 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 11 Unique: [ 9 12 17 15 5 14 13 11 8 10 18 16 19 7 6 4 21 20 3 2 22 1 23 0] Median : 12.0, Variance: 11.098717981459602, Describe: count 1413701.000 mean 12.479 std 3.331 min 0.000 25% 10.000 50% 12.000 75% 15.000 max 23.000 Name: HOUR_APPR_PROCESS_START_right, dtype: float64 ValueCounts: HOUR_APPR_PROCESS_START_right 11 11.549 12 11.143 10 10.879 13 10.316 14 9.452 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Y Unique: ['Y' 'N'] ValueCounts: FLAG_LAST_APPL_PER_CONTRACT Y 99.483 N 0.517 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 1 Unique: [1 0] Median : 1.0, Variance: 0.003555212075985266, Describe: count 1413701.000 mean 0.996 std 0.060 min 0.000 25% 1.000 50% 1.000 75% 1.000 max 1.000 Name: NFLAG_LAST_APPL_IN_DAY, dtype: float64 ValueCounts: NFLAG_LAST_APPL_IN_DAY 1 99.643 0 0.357 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'XNA' 'Other' 'Payments on other loans' 'Buying a used car' 'Repairs' 'Education' 'Buying a new car' 'Everyday expenses' 'Medicine' 'Car repairs' 'Urgent needs' 'Buying a holiday home / land' 'Building a house or an annex' 'Furniture' 'Journey' 'Purchase of electronic equipment' 'Wedding / gift / holiday' 'Buying a home' 'Business development' 'Gasification / water supply' 'Buying a garage' 'Hobby' 'Money for a third person' 'Refusal to name the goal'] ValueCounts: NAME_CASH_LOAN_PURPOSE XAP 55.665 XNA 40.132 Repairs 1.423 Other 0.950 Urgent needs 0.512 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Approved Unique: ['Approved' 'Canceled' 'Refused' 'Unused offer'] ValueCounts: NAME_CONTRACT_STATUS Approved 62.679 Canceled 18.352 Refused 17.358 Unused offer 1.611 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 245 Median : 582.0, Variance: 613935.1991857325, Describe: count 1413701.000 mean 880.367 std 783.540 min 1.000 25% 271.000 50% 582.000 75% 1313.000 max 2922.000 Name: DAYS_DECISION, dtype: float64 ValueCounts: DAYS_DECISION 245 0.154 238 0.153 210 0.149 224 0.146 252 0.145 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash through the bank Unique: ['XNA' 'Cash through the bank' 'Non-cash from your account' 'Cashless from the account of the employer'] ValueCounts: NAME_PAYMENT_TYPE Cash through the bank 62.439 XNA 36.992 Non-cash from your account 0.502 Cashless from the account of the employer 0.067 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XAP Unique: ['XAP' 'LIMIT' 'HC' 'SCO' 'SCOFR' 'VERIF' 'CLIENT' 'XNA' 'SYSTEM'] ValueCounts: CODE_REJECT_REASON XAP 81.031 HC 10.326 LIMIT 3.379 SCO 2.309 CLIENT 1.611 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Unaccompanied Unique: ['Unaccompanied' 'Family' 'Spouse, partner' 'Children' 'Other_A' 'Other_B' 'Group of people'] ValueCounts: NAME_TYPE_SUITE_right Unaccompanied 79.388 Family 12.866 Spouse, partner 4.073 Children 1.917 Other_B 1.066 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Repeater Unique: ['New' 'Repeater' 'Refreshed' 'XNA'] ValueCounts: NAME_CLIENT_TYPE Repeater 73.402 New 18.359 Refreshed 8.130 XNA 0.109 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Vehicles' 'XNA' 'Furniture' 'Consumer Electronics' 'Mobile' 'Audio/Video' 'Construction Materials' 'Gardening' 'Photo / Cinema Equipment' 'Computers' 'Clothing and Accessories' 'Homewares' 'Medical Supplies' 'Other' 'Jewelry' 'Office Appliances' 'Tourism' 'Auto Accessories' 'Sport and Leisure' 'Medicine' 'Weapon' 'Direct Sales' 'Fitness' 'Insurance' 'Additional Service' 'Education' 'Animals'] ValueCounts: NAME_GOODS_CATEGORY XNA 56.392 Mobile 13.705 Consumer Electronics 7.412 Computers 6.361 Audio/Video 6.048 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: POS Unique: ['POS' 'Cash' 'XNA' 'Cards' 'Cars'] ValueCounts: NAME_PORTFOLIO POS 41.908 Cash 27.705 XNA 21.731 Cards 8.629 Cars 0.027 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['XNA' 'x-sell' 'walk-in'] ValueCounts: NAME_PRODUCT_TYPE XNA 63.666 x-sell 27.261 walk-in 9.072 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Credit and cash offices Unique: ['Stone' 'Credit and cash offices' 'Country-wide' 'Regional / Local' 'AP+ (Cash loan)' 'Contact center' 'Channel of corporate sales' 'Car dealer'] ValueCounts: CHANNEL_TYPE Credit and cash offices 42.466 Country-wide 29.926 Stone 12.981 Regional / Local 6.541 Contact center 4.166 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: -1 Median : 4.0, Variance: 59214281.81824192, Describe: count 1413701.000 mean 314.988 std 7695.082 min -1.000 25% -1.000 50% 4.000 75% 85.000 max 4000000.000 Name: SELLERPLACE_AREA, dtype: float64 ValueCounts: SELLERPLACE_AREA -1 45.030 0 3.585 50 2.242 30 2.100 20 2.057 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['Auto technology' 'XNA' 'Furniture' 'Consumer electronics' 'Connectivity' 'Construction' 'Clothing' 'Industry' 'Tourism' 'Jewelry' 'MLM partners'] ValueCounts: NAME_SELLER_INDUSTRY XNA 50.599 Consumer electronics 24.171 Connectivity 16.860 Furniture 3.464 Construction 1.798 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 12.0 Unique: [24. 12. 6. 4. 48. 0. 18. 10. 30. 8. 16. 36. 5. 60. 42. 14. 54. 20. 3. 15. 72. 9. 7. 11. 22. 17. 84. 13. 29. 59. 53. 40. 23. 26. 28. 32. 34. 38. 19. 66. 35. 33. 39. 44. 21. 47. 41. 45.] Median : 12.0, Variance: 168.87129923304957, Describe: count 1413701.000 mean 15.171 std 12.995 min 0.000 25% 10.000 50% 12.000 75% 18.000 max 84.000 Name: CNT_PAYMENT, dtype: float64 ValueCounts: CNT_PAYMENT 12.000 41.202 6.000 11.513 10.000 8.668 0.000 8.629 24.000 8.260 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: XNA Unique: ['low_normal' 'middle' 'XNA' 'high' 'low_action'] ValueCounts: NAME_YIELD_GROUP XNA 30.360 middle 22.987 high 21.650 low_normal 19.444 low_action 5.559 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: Cash Unique: ['POS other with interest' 'Cash X-Sell: low' 'POS industry with interest' 'POS household with interest' 'POS mobile without interest' 'Card Street' 'Card X-Sell' 'Cash X-Sell: high' 'Cash' 'Cash Street: high' 'Cash X-Sell: middle' 'POS mobile with interest' 'POS household without interest' 'POS industry without interest' 'Cash Street: low' 'Cash Street: middle' 'POS others without interest'] ValueCounts: PRODUCT_COMBINATION Cash 16.652 POS household with interest 16.019 POS mobile with interest 13.500 Cash X-Sell: middle 8.491 Cash X-Sell: low 7.823 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 365243 Median : 365243.0, Variance: 4829581043.088714, Describe: count 1413701.000 mean 351460.352 std 69495.187 min 2.000 25% 365243.000 50% 365243.000 75% 365243.000 max 365243.000 Name: DAYS_FIRST_DRAWING, dtype: float64 ValueCounts: DAYS_FIRST_DRAWING 365243 96.216 228 0.008 224 0.008 212 0.008 220 0.008 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 831 Median : 831.0, Variance: 3111196020.1185093, Describe: count 1413701.000 mean 9743.652 std 55778.096 min 2.000 25% 738.000 50% 831.000 75% 1146.000 max 365243.000 Name: DAYS_FIRST_DUE, dtype: float64 ValueCounts: DAYS_FIRST_DUE 831 39.724 365243 2.402 334 0.047 299 0.046 208 0.046 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 361 Median : 361.0, Variance: 7021672901.056603, Describe: count 1413701.000 mean 21052.263 std 83795.423 min 0.000 25% 361.000 50% 361.000 75% 1010.000 max 365243.000 Name: DAYS_LAST_DUE_1ST_VERSION, dtype: float64 ValueCounts: DAYS_LAST_DUE_1ST_VERSION 361 39.733 365243 5.595 1 0.082 8 0.081 10 0.081 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 537 Median : 537.0, Variance: 14814814480.760283, Describe: count 1413701.000 mean 47395.172 std 121716.123 min 2.000 25% 537.000 50% 537.000 75% 1542.000 max 365243.000 Name: DAYS_LAST_DUE, dtype: float64 ValueCounts: DAYS_LAST_DUE 537 39.720 365243 12.789 239 0.040 188 0.040 245 0.040 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 499 Median : 499.0, Variance: 15729348068.356016, Describe: count 1413701.000 mean 50774.791 std 125416.698 min 2.000 25% 499.000 50% 499.000 75% 1570.000 max 365243.000 Name: DAYS_TERMINATION, dtype: float64 ValueCounts: DAYS_TERMINATION 499 39.721 365243 13.723 156 0.046 233 0.046 170 0.045 Name: proportion, dtype: float64 ------------------------------------------------------------------ Total Nulls: 0, Mode: 0.0 Unique: [0. 1.] Median : 0.0, Variance: 0.15961934771683145, Describe: count 1413701.000 mean 0.199 std 0.400 min 0.000 25% 0.000 50% 0.000 75% 0.000 max 1.000 Name: NFLAG_INSURED_ON_APPROVAL, dtype: float64 ValueCounts: NFLAG_INSURED_ON_APPROVAL 0.000 80.063 1.000 19.937 Name: proportion, dtype: float64 ------------------------------------------------------------------
AMT_CREDIT_right 0.000 TARGET 0.000 AMT_ANNUITY_right 0.000 NAME_CONTRACT_TYPE_right 0.000 SK_ID_PREV 0.000 AGE_Category 0.000 YEARS_EMPLOYED_BINS 0.000 AMT_CREDIT_BINS 0.000 YEARS_BEGINEXPLOITATION 0.000 FLOORSMAX 0.000 AMT_REQ_CREDIT_BUREAU_YEAR 0.000 AMT_REQ_CREDIT_BUREAU_QRT 0.000 AMT_REQ_CREDIT_BUREAU_MON 0.000 AMT_REQ_CREDIT_BUREAU_WEEK 0.000 AMT_REQ_CREDIT_BUREAU_DAY 0.000 AMT_REQ_CREDIT_BUREAU_HOUR 0.000 FLAG_DOCUMENT_21 0.000 FLAG_DOCUMENT_20 0.000 FLAG_DOCUMENT_19 0.000 FLAG_DOCUMENT_18 0.000 FLAG_DOCUMENT_17 0.000 FLAG_DOCUMENT_16 0.000 FLAG_DOCUMENT_15 0.000 FLAG_DOCUMENT_14 0.000 FLAG_DOCUMENT_13 0.000 FLAG_DOCUMENT_12 0.000 FLAG_DOCUMENT_11 0.000 AMT_APPLICATION 0.000 SK_ID_CURR 0.000 AMT_GOODS_PRICE_right 0.000 NAME_PRODUCT_TYPE 0.000 DAYS_TERMINATION 0.000 DAYS_LAST_DUE 0.000 DAYS_LAST_DUE_1ST_VERSION 0.000 DAYS_FIRST_DUE 0.000 DAYS_FIRST_DRAWING 0.000 PRODUCT_COMBINATION 0.000 NAME_YIELD_GROUP 0.000 CNT_PAYMENT 0.000 NAME_SELLER_INDUSTRY 0.000 SELLERPLACE_AREA 0.000 CHANNEL_TYPE 0.000 NAME_PORTFOLIO 0.000 WEEKDAY_APPR_PROCESS_START_right 0.000 NAME_GOODS_CATEGORY 0.000 NAME_CLIENT_TYPE 0.000 NAME_TYPE_SUITE_right 0.000 CODE_REJECT_REASON 0.000 NAME_PAYMENT_TYPE 0.000 DAYS_DECISION 0.000 NAME_CONTRACT_STATUS 0.000 NAME_CASH_LOAN_PURPOSE 0.000 NFLAG_LAST_APPL_IN_DAY 0.000 FLAG_LAST_APPL_PER_CONTRACT 0.000 HOUR_APPR_PROCESS_START_right 0.000 FLAG_DOCUMENT_10 0.000 FLAG_DOCUMENT_9 0.000 FLAG_DOCUMENT_8 0.000 NAME_FAMILY_STATUS 0.000 FLAG_PHONE 0.000 FLAG_CONT_MOBILE 0.000 FLAG_WORK_PHONE 0.000 FLAG_EMP_PHONE 0.000 FLAG_MOBIL 0.000 YEARS_ID_PUBLISH 0.000 YEARS_REGISTRATION 0.000 YEARS_EMPLOYED 0.000 YEARS_BIRTH 0.000 REGION_POPULATION_RELATIVE 0.000 NAME_HOUSING_TYPE 0.000 NAME_EDUCATION_TYPE 0.000 OCCUPATION_TYPE 0.000 NAME_INCOME_TYPE 0.000 NAME_TYPE_SUITE 0.000 AMT_GOODS_PRICE 0.000 AMT_ANNUITY 0.000 AMT_CREDIT 0.000 AMT_INCOME_TOTAL 0.000 CNT_CHILDREN 0.000 FLAG_OWN_REALTY 0.000 FLAG_OWN_CAR 0.000 CODE_GENDER 0.000 NAME_CONTRACT_TYPE 0.000 FLAG_EMAIL 0.000 CNT_FAM_MEMBERS 0.000 FLAG_DOCUMENT_7 0.000 TOTALAREA_MODE 0.000 FLAG_DOCUMENT_6 0.000 FLAG_DOCUMENT_5 0.000 FLAG_DOCUMENT_4 0.000 FLAG_DOCUMENT_3 0.000 FLAG_DOCUMENT_2 0.000 YEARS_LAST_PHONE_CHANGE 0.000 DEF_60_CNT_SOCIAL_CIRCLE 0.000 OBS_60_CNT_SOCIAL_CIRCLE 0.000 DEF_30_CNT_SOCIAL_CIRCLE 0.000 OBS_30_CNT_SOCIAL_CIRCLE 0.000 EMERGENCYSTATE_MODE 0.000 EXT_SOURCE_3 0.000 REGION_RATING_CLIENT 0.000 EXT_SOURCE_2 0.000 ORGANIZATION_TYPE 0.000 LIVE_CITY_NOT_WORK_CITY 0.000 REG_CITY_NOT_WORK_CITY 0.000 REG_CITY_NOT_LIVE_CITY 0.000 LIVE_REGION_NOT_WORK_REGION 0.000 REG_REGION_NOT_WORK_REGION 0.000 REG_REGION_NOT_LIVE_REGION 0.000 HOUR_APPR_PROCESS_START 0.000 WEEKDAY_APPR_PROCESS_START 0.000 REGION_RATING_CLIENT_W_CITY 0.000 NFLAG_INSURED_ON_APPROVAL 0.000 dtype: float64
# combined_desc_cols = combined_appl_data.describe().columns
# dtype_dict = classify_feature_dtype(combined_appl_data, combined_desc_cols)
combined_appl_data.iloc[(combined_appl_data.SK_ID_CURR.value_counts() > 1).index, :].sort_index()
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | YEARS_BIRTH | YEARS_EMPLOYED | YEARS_REGISTRATION | YEARS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | YEARS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | FLOORSMAX | YEARS_BEGINEXPLOITATION | AMT_CREDIT_BINS | YEARS_EMPLOYED_BINS | AGE_Category | SK_ID_PREV | NAME_CONTRACT_TYPE_right | AMT_ANNUITY_right | AMT_APPLICATION | AMT_CREDIT_right | AMT_GOODS_PRICE_right | WEEKDAY_APPR_PROCESS_START_right | HOUR_APPR_PROCESS_START_right | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE_right | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100002 | 125444 | 1 | Cash loans | F | No | Yes | 0 | 180000.000 | 1350000.000 | 39465.000 | 1350000.000 | Family | Pensioner | Secondary | Married | House | 0.031 | 55 | 1000 | 29 | 10 | 1 | 0 | 0 | 1 | 0 | 0 | Pensioner | 2.000 | 2 | 2 | SATURDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.722 | 0.521 | 0.103 | No | 1.000 | 0.000 | 1.000 | 0.000 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.167 | 0.978 | 1M+ | 60+ | 50-60 | 1678168 | Consumer loans | 13229.460 | 53950.500 | 46435.500 | 53950.500 | SUNDAY | 15 | Y | 1 | XAP | Approved | 2512 | Cash through the bank | XAP | Family | Repeater | Mobile | POS | XNA | Country-wide | 212 | Connectivity | 4.000 | low_normal | POS mobile with interest | 365243 | 2481 | 2391 | 2391 | 2386 | 1.000 |
100003 | 125445 | 0 | Cash loans | F | No | Yes | 1 | 72000.000 | 354276.000 | 20466.000 | 292500.000 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.009 | 43 | 3 | 17 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | Cooking staff | 3.000 | 2 | 2 | WEDNESDAY | 13 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 3 | 0.559 | 0.585 | 0.083 | No | 1.000 | 0.000 | 1.000 | 0.000 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 2.000 | 0.167 | 0.972 | 200-400k | 0-10 | 40-50 | 1217532 | Consumer loans | 2358.990 | 18391.500 | 20214.000 | 18391.500 | WEDNESDAY | 10 | Y | 1 | XAP | Approved | 2443 | Cash through the bank | XAP | Other_B | New | Mobile | POS | XNA | Country-wide | 61 | Connectivity | 12.000 | high | POS mobile with interest | 365243 | 2412 | 2082 | 2112 | 2110 | 1.000 |
100004 | 125445 | 0 | Cash loans | F | No | Yes | 1 | 72000.000 | 354276.000 | 20466.000 | 292500.000 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.009 | 43 | 3 | 17 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | Cooking staff | 3.000 | 2 | 2 | WEDNESDAY | 13 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 3 | 0.559 | 0.585 | 0.083 | No | 1.000 | 0.000 | 1.000 | 0.000 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 2.000 | 0.167 | 0.972 | 200-400k | 0-10 | 40-50 | 1443027 | Cash loans | 10332.720 | 90000.000 | 95940.000 | 90000.000 | WEDNESDAY | 13 | Y | 1 | XNA | Approved | 1022 | Cash through the bank | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.000 | middle | Cash X-Sell: middle | 365243 | 992 | 662 | 752 | 744 | 1.000 |
100006 | 125445 | 0 | Cash loans | F | No | Yes | 1 | 72000.000 | 354276.000 | 20466.000 | 292500.000 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.009 | 43 | 3 | 17 | 10 | 1 | 1 | 0 | 1 | 0 | 0 | Cooking staff | 3.000 | 2 | 2 | WEDNESDAY | 13 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 3 | 0.559 | 0.585 | 0.083 | No | 1.000 | 0.000 | 1.000 | 0.000 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 2.000 | 0.167 | 0.972 | 200-400k | 0-10 | 40-50 | 2833540 | Consumer loans | 3276.180 | 14580.000 | 15300.000 | 14580.000 | SATURDAY | 12 | Y | 1 | XAP | Approved | 1390 | Cash through the bank | XAP | Children | Repeater | Photo / Cinema Equipment | POS | XNA | Country-wide | 51 | Connectivity | 6.000 | high | POS mobile with interest | 365243 | 1359 | 1209 | 1209 | 1200 | 0.000 |
100007 | 125446 | 0 | Cash loans | F | No | No | 0 | 225000.000 | 1687266.000 | 62658.000 | 1575000.000 | Family | State servant | Secondary | Married | House | 0.014 | 59 | 6 | 27 | 11 | 1 | 1 | 0 | 1 | 0 | 0 | Pensioner | 2.000 | 2 | 2 | SATURDAY | 5 | 0 | 0 | 0 | 0 | 0 | 0 | Other | 0.675 | 0.691 | 0.103 | No | 1.000 | 1.000 | 1.000 | 1.000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.167 | 0.978 | 1M+ | 0-10 | 50-60 | 1201611 | Consumer loans | 22873.815 | 249714.000 | 249714.000 | 249714.000 | MONDAY | 12 | Y | 1 | XAP | Approved | 1447 | XNA | XAP | Family | New | Audio/Video | POS | XNA | Regional / Local | 134 | Consumer electronics | 12.000 | low_action | POS household without interest | 365243 | 1416 | 1086 | 1086 | 1083 | 0.000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
456251 | 215478 | 0 | Cash loans | F | No | Yes | 0 | 144000.000 | 948582.000 | 27864.000 | 679500.000 | Unaccompanied | Working | Secondary | Separated | House | 0.025 | 59 | 39 | 10 | 13 | 1 | 1 | 0 | 1 | 0 | 0 | Pensioner | 1.000 | 2 | 2 | SATURDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Industry: type 7 | 0.192 | 0.700 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 5.000 | 0.167 | 0.978 | 800-1M | 30-40 | 50-60 | 2103396 | Consumer loans | 20482.110 | 221985.000 | 199786.500 | 221985.000 | SATURDAY | 13 | Y | 1 | XAP | Approved | 427 | Cash through the bank | XAP | Unaccompanied | Refreshed | Furniture | POS | XNA | Stone | 50 | Furniture | 12.000 | middle | POS industry with interest | 365243 | 393 | 63 | 333 | 328 | 0.000 |
456252 | 215478 | 0 | Cash loans | F | No | Yes | 0 | 144000.000 | 948582.000 | 27864.000 | 679500.000 | Unaccompanied | Working | Secondary | Separated | House | 0.025 | 59 | 39 | 10 | 13 | 1 | 1 | 0 | 1 | 0 | 0 | Pensioner | 1.000 | 2 | 2 | SATURDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Industry: type 7 | 0.192 | 0.700 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 5.000 | 0.167 | 0.978 | 800-1M | 30-40 | 50-60 | 2396415 | Cash loans | 44178.750 | 454500.000 | 472500.000 | 454500.000 | THURSDAY | 15 | Y | 1 | XNA | Approved | 324 | Cash through the bank | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.000 | low_normal | Cash X-Sell: low | 365243 | 831 | 361 | 537 | 499 | 0.000 |
456253 | 215478 | 0 | Cash loans | F | No | Yes | 0 | 144000.000 | 948582.000 | 27864.000 | 679500.000 | Unaccompanied | Working | Secondary | Separated | House | 0.025 | 59 | 39 | 10 | 13 | 1 | 1 | 0 | 1 | 0 | 0 | Pensioner | 1.000 | 2 | 2 | SATURDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Industry: type 7 | 0.192 | 0.700 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 5.000 | 0.167 | 0.978 | 800-1M | 30-40 | 50-60 | 1263974 | Consumer loans | 22313.070 | 139455.000 | 125509.500 | 139455.000 | SATURDAY | 13 | Y | 1 | XAP | Approved | 336 | XNA | XAP | Unaccompanied | Repeater | Audio/Video | POS | XNA | Country-wide | 261 | Consumer electronics | 6.000 | low_normal | POS household with interest | 365243 | 305 | 155 | 185 | 181 | 0.000 |
456254 | 215478 | 0 | Cash loans | F | No | Yes | 0 | 144000.000 | 948582.000 | 27864.000 | 679500.000 | Unaccompanied | Working | Secondary | Separated | House | 0.025 | 59 | 39 | 10 | 13 | 1 | 1 | 0 | 1 | 0 | 0 | Pensioner | 1.000 | 2 | 2 | SATURDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Industry: type 7 | 0.192 | 0.700 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 5.000 | 0.167 | 0.978 | 800-1M | 30-40 | 50-60 | 2511765 | Cash loans | 11250.000 | 0.000 | 0.000 | 112320.000 | THURSDAY | 15 | Y | 1 | XNA | Canceled | 324 | XNA | XAP | Unaccompanied | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | 12.000 | XNA | Cash | 365243 | 831 | 361 | 537 | 499 | 0.000 |
456255 | 215478 | 0 | Cash loans | F | No | Yes | 0 | 144000.000 | 948582.000 | 27864.000 | 679500.000 | Unaccompanied | Working | Secondary | Separated | House | 0.025 | 59 | 39 | 10 | 13 | 1 | 1 | 0 | 1 | 0 | 0 | Pensioner | 1.000 | 2 | 2 | SATURDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Industry: type 7 | 0.192 | 0.700 | 0.103 | No | 0.000 | 0.000 | 0.000 | 0.000 | 7 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 5.000 | 0.167 | 0.978 | 800-1M | 30-40 | 50-60 | 2351113 | Cash loans | 11250.000 | 0.000 | 0.000 | 112320.000 | WEDNESDAY | 14 | Y | 1 | XNA | Canceled | 164 | XNA | XAP | Unaccompanied | Repeater | XNA | XNA | XNA | Credit and cash offices | -1 | XNA | 12.000 | XNA | Cash | 365243 | 831 | 361 | 537 | 499 | 0.000 |
291057 rows × 112 columns
comb_target_train_0 = combined_appl_data.loc[combined_appl_data['TARGET'] == 0]
comb_target_train_1 = combined_appl_data.loc[combined_appl_data['TARGET'] == 1]
combined_desc_cols = combined_appl_data.describe().columns
FLAG_COLS = combined_appl_data.filter(like="FLAG").columns
combined_desc_cols = list(set(combined_desc_cols) - set(FLAG_COLS))
combined_dtype_dict = classify_feature_dtype(combined_appl_data, combined_desc_cols)
combined_dtype_floats = combined_dtype_dict['float_ts']
combined_desc_cols
['YEARS_EMPLOYED', 'FLOORSMAX', 'YEARS_BEGINEXPLOITATION', 'SELLERPLACE_AREA', 'REG_REGION_NOT_WORK_REGION', 'DAYS_TERMINATION', 'YEARS_ID_PUBLISH', 'DAYS_FIRST_DRAWING', 'CNT_FAM_MEMBERS', 'DEF_30_CNT_SOCIAL_CIRCLE', 'CNT_CHILDREN', 'CNT_PAYMENT', 'AMT_GOODS_PRICE', 'AMT_ANNUITY', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_ANNUITY_right', 'DAYS_LAST_DUE', 'AMT_GOODS_PRICE_right', 'OBS_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_CREDIT', 'DEF_60_CNT_SOCIAL_CIRCLE', 'HOUR_APPR_PROCESS_START', 'REG_CITY_NOT_LIVE_CITY', 'LIVE_REGION_NOT_WORK_REGION', 'SK_ID_CURR', 'AMT_INCOME_TOTAL', 'HOUR_APPR_PROCESS_START_right', 'OBS_30_CNT_SOCIAL_CIRCLE', 'REGION_POPULATION_RELATIVE', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_CREDIT_right', 'SK_ID_PREV', 'REG_CITY_NOT_WORK_CITY', 'DAYS_LAST_DUE_1ST_VERSION', 'YEARS_LAST_PHONE_CHANGE', 'TARGET', 'REGION_RATING_CLIENT_W_CITY', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'YEARS_BIRTH', 'REG_REGION_NOT_LIVE_REGION', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_APPLICATION', 'EXT_SOURCE_3', 'YEARS_REGISTRATION', 'TOTALAREA_MODE', 'DAYS_DECISION', 'EXT_SOURCE_2', 'LIVE_CITY_NOT_WORK_CITY', 'REGION_RATING_CLIENT', 'DAYS_FIRST_DUE']
# univariate_plots(combined_appl_data.head(60), combined_appl_data.select_dtypes(include=['category','object']).columns, t0 = comb_target_train_0, t1 = comb_target_train_1, ftype="category")
# print("--------------------------------------------------------------------------")
# univariate_plots(combined_appl_data.head(60), combined_dtype_dict['int_cat'], t0 = comb_target_train_0, t1 = comb_target_train_1, ftype="category")
# print("--------------------------------------------------------------------------")
# univariate_plots(combined_appl_data.head(60), combined_dtype_dict['float_ts'], ftype="non_categorical")
univariate_plots(combined_appl_data, combined_appl_data.select_dtypes(include=['category','object']).columns, t0 = comb_target_train_0, t1 = comb_target_train_1, ftype="category")
print("--------------------------------------------------------------------------")
univariate_plots(combined_appl_data, combined_dtype_dict['int_cat'], t0 = comb_target_train_0, t1 = comb_target_train_1, ftype="category")
print("--------------------------------------------------------------------------")
univariate_plots(combined_appl_data, combined_dtype_dict['float_ts'], ftype="non_categorical")
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
--------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
--------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Bivariate Analysis¶
plt.figure(figsize=(8,5))
sns.countplot(x='NAME_EDUCATION_TYPE', hue='NAME_CONTRACT_STATUS', data=comb_target_train_0, palette='viridis', stat="percent")
plt.xticks(rotation=0)
plt.show()
<Figure size 800x500 with 0 Axes>
<Axes: xlabel='NAME_EDUCATION_TYPE', ylabel='percent'>
([0, 1, 2, 3, 4], [Text(0, 0, 'Higher education'), Text(1, 0, 'Secondary'), Text(2, 0, 'Incomplete higher'), Text(3, 0, 'Lower secondary'), Text(4, 0, 'Academic degree')])
plt.figure(figsize=(8,5))
grouped_df = (curr_appl_data1.groupby(["NAME_CONTRACT_TYPE"])["TARGET"].value_counts(normalize=True)*100).reset_index()
sns.barplot(x='NAME_CONTRACT_TYPE', y='proportion', hue='TARGET', data=grouped_df, palette='magma')
plt.xticks(rotation=0)
plt.show()
<Figure size 800x500 with 0 Axes>
<Axes: xlabel='NAME_CONTRACT_TYPE', ylabel='proportion'>
([0, 1], [Text(0, 0, 'Cash loans'), Text(1, 0, 'Revolving loans')])
# sns.pairplot(curr_appl_data1[curr_dtype_dict])
# sns.pairplot(prev_appl_data1[prev_dtype_dict])
bivariate_plots(df=combined_appl_data, col='NAME_EDUCATION_TYPE', hue='NAME_CONTRACT_STATUS', train0=comb_target_train_0, train1=comb_target_train_1 )
bivariate_plots(df=combined_appl_data, col='NAME_CLIENT_TYPE', hue='NAME_CONTRACT_STATUS', train0=comb_target_train_0, train1=comb_target_train_1 )
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Client who are Secondary educated have mostly applied for the loan, They have the greater chance of paying their installments on time without difficulties among the given categories.
Most loan applicants are Repeaters Also Most loan applicants who do not have difficulty in paying their installments are Repeaters among the given categories.
sk_id_tg = combined_appl_data.groupby(by=['SK_ID_CURR'])['TARGET'].value_counts() # grouped all sku ids wrt target
sk_id_tg_df = sk_id_tg[sk_id_tg > 2].reset_index()
sk_id_tg_df['count_bins'] = pd.cut(sk_id_tg_df['count'], bins=[0,10,20,30,40,50,60,70], labels=['>10','10-20','20-30','30-40','40-50','50-60','60+'])
fig, axes = plt.subplots(1, 2, figsize=(20, 6));
sns.barplot(x='count_bins', y='count', hue='TARGET', data=sk_id_tg_df, palette='magma', ax=axes[0]);
axes[0].set_title("Barplot Analysis of SK_ID's that have more than 1 application wrt TARGET"); plt.xticks(rotation=45);
axes[0].plot([5.8], [70], marker='o', color='red', markersize=10, linestyle='--', linewidth=2)
sns.countplot(x='count_bins', hue='TARGET', data=sk_id_tg_df, palette='magma', ax=axes[1])
axes[1].set_title("Countplot Analysis of SK_ID's that have more than 1 application wrt TARGET"); plt.xticks(rotation=45);
fig.subplots_adjust(wspace=0.5, hspace=0.3);
plt.tight_layout();
plt.subplots_adjust(top=0.85);
plt.show();
plt.clf();
<Figure size 640x480 with 0 Axes>
sk_id_tg[sk_id_tg > 60]
sk_id_tg[sk_id_tg > 60].count()
SK_ID_CURR TARGET 173680 0 72 198355 0 63 206783 0 67 238250 0 61 242412 0 68 265681 0 73 280586 0 61 345161 0 62 382179 0 64 389950 0 64 446486 0 62 Name: count, dtype: int64
11
- Using the combined data (application.csv , prev_application.csv)
- Having looked at the graph, it was very surprising to see the count_bins 60+ category,
- Those sku's have no payment difficulties. there were 11 of them.
- Further analysis may give out the reason for such outcomes.
Correlation Matrix¶
import numpy as np
corr = comb_target_train_0[comb_target_train_0.select_dtypes(exclude=['category','object']).columns].corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
with sns.axes_style("white"):
ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
import numpy as np
corr = comb_target_train_1[comb_target_train_1.select_dtypes(exclude=['category','object']).columns].corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
with sns.axes_style("white"):
ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
Heatmap correlation of all numerical features It is more or less similar for both targets AMT_CREDIT is high for Yongsters 'YEARS_BIRTH' AMT_CREDIT is high for low 'CNT_CHILDREN'
Top 10 Correlations¶
corr_0 = curr_target_train_0.corr(numeric_only=True).abs()
corr_0 = corr_0.unstack()
correlation_0 = corr_0.sort_values()
correlation_0 = corr_0.dropna()
# correlation_0
correlation_0 = correlation_0[correlation_0 != 1.0]
correlation_target_zero = correlation_0.reset_index()
correlation_target_zero.sort_values(by=0, ascending=False).head(10)
level_0 | level_1 | 0 | |
---|---|---|---|
507 | YEARS_EMPLOYED | FLAG_EMP_PHONE | 1.000 |
752 | FLAG_EMP_PHONE | YEARS_EMPLOYED | 1.000 |
2014 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 0.999 |
1891 | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | 0.999 |
313 | AMT_GOODS_PRICE | AMT_CREDIT | 0.987 |
190 | AMT_CREDIT | AMT_GOODS_PRICE | 0.987 |
1196 | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | 0.950 |
1134 | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | 0.950 |
78 | CNT_CHILDREN | CNT_FAM_MEMBERS | 0.879 |
1055 | CNT_FAM_MEMBERS | CNT_CHILDREN | 0.879 |
corr_1 = curr_target_train_1.corr(numeric_only=True).abs()
corr_1 = corr_1.unstack()
correlation_1 = corr_1.sort_values()
correlation_1 = corr_1.dropna()
# correlation_1
correlation_1 = correlation_1[correlation_1 != 1.0]
correlation_target_one = correlation_1.reset_index()
correlation_target_one.sort_values(by=0, ascending=False).head(10)
level_0 | level_1 | 0 | |
---|---|---|---|
474 | YEARS_EMPLOYED | FLAG_EMP_PHONE | 1.000 |
646 | FLAG_EMP_PHONE | YEARS_EMPLOYED | 1.000 |
1712 | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | 0.998 |
1827 | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | 0.998 |
293 | AMT_GOODS_PRICE | AMT_CREDIT | 0.983 |
178 | AMT_CREDIT | AMT_GOODS_PRICE | 0.983 |
1003 | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | 0.957 |
1061 | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | 0.957 |
929 | CNT_FAM_MEMBERS | CNT_CHILDREN | 0.885 |
73 | CNT_CHILDREN | CNT_FAM_MEMBERS | 0.885 |
a = combined_appl_data[["AMT_INCOME_TOTAL","AMT_CREDIT","AMT_ANNUITY","AMT_GOODS_PRICE"]].corr()
sns.heatmap(a, annot=True, cmap="viridis")
<Axes: >
Heatmap correlation of critical features
# curr_appl_data1.to_csv('curr.csv')
# prev_appl_data1.to_csv('prev.csv')
pairplot = sns.pairplot(curr_appl_data1[['AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE', 'CNT_FAM_MEMBERS', 'CNT_CHILDREN', 'YEARS_REGISTRATION', 'YEARS_BIRTH', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']])
plt.show();
# Rotate the x-axis labels
# for ax in pairplot.axes.flat:
# ax.set_xlabel(ax.get_xlabel(), rotation=90, labelpad=10)
# ax.set_ylabel(ax.get_ylabel(), rotation=0, ha='right', labelpad=10)
Some of the high linear relationships observed as below: • 'AMT_CREDIT','AMT_ANNUITY','AMT_GOODS_PRICE’, • Therefore More the price of the goods, higher the credit amount • 'CNT_FAM_MEMBERS','CNT_CHILDREN’, • 'YEARS_REGISTRATION','YEARS_BIRTH’, • 'AMT_CREDIT vs 'AMT_REQ_CREDIT_BUREAU_YEAR’,