본문 바로가기

Kaggle, Dacon, 공모전

# Project: EDA and preprocessing for a box office revenue

Introduction

I just watch movies if my favourite actors are on there. Even if it is not hit. I just love movie stuff. so I'm truly excited being on this project!. even though I've got to predict the revenue haha. at this project, we will explore the data(EDA) and train a model! let's get started. Kaggle.

 

 

Basically, the data is quite tidy, not many missing values. and I believe some data are not important to the target variable.

The columns are title, budget, language, cast, revenue, release date, runtime, and so forth..

 

EDA

We will look at the data based on the revenue column. So, the relationship between the revenue and other features would be like

-revenue VS budget, popularity, production_companies, genres, release_date, runtime, cast

First of all, let's see the correlation with the revenue.

I do not think there is highly correlated with the target variable except budget!.

 

Visualization

revenue vs budget.                                                                                                              revenue vs popularity.

Top 11 companies by release-movies.

revenue VS Top 11 companies.

revenue vs production_companies.

The red graph told us that the companies have released not many movies but earned pretty good money.

 

What happened to 2017!. Maybe it is not all counted.

Friday is the hottest day.!

Pretty busy on July, August, September but looks quite flat.

Revenue VS runtime

 

Normally, a movie's runtime is around 100 ~ 150 min.

 

I actually supposed two movies which are over 240 min for runtime as outliers,  but it is true. Cleopatra (1963 film), Carlos.
Carlos is over 330 min.. insane..

 

 

Preprocessing

Missing value processing

genres, runtime,

spoken_languages, production_companies, production_countries, Keywords are filled by mode

tagline, crew, cast, overview are filled by 0

production_companies_count is filled by 1

I dropped 'belongs_to_collection', 'homepage','status' columns.

poster_path is filled by backfill(fillna)

 

Thank you for watching!