This project is an application of Bernoulli Naive Bayes Classifier for High-dimensional Dataset Prediction.
This algorithm aims to use existing data as training data to train the Naive Bayes Classifier and predict unknown data. The training data includes train.txt
, user.txt
, movie.txt
and test.txt
. user.txt
contains users’ attributes such as age, gender, occupation. movie.txt
contains movies’ attributes including year and genres (one movie can have multiple genres). All attributes may have N/A
type, which means that data is unknown or missing. train.txt
is training data for the classifier. It contains two IDs of users and movies for identification and a rating from a specific user to a certain movie, i.e., namely, user id, movie id, and rating. test.txt
is the data from predicting and it only consists of user id and movie id.
This algorithm will first integrate all the data into a DataFrame with matching attributes from user.txt
and movie.txt
to the train.txt
and test.txt
. Then preprocess the data by applying the One-Hot-Encoding method to convert categorical data into binary data. Finally, we use the processed data to train the Bernoulli Naive Bayes classifier, build a training model and predict the rating for the test data.