This project had two parts. The first part was a Kaggle Competition. This competition was held between various groups within the class. The first part of the project used the Airbnb data from all the markets in the United States. The second part of the project focused on specific market in the US and on creating a business case for the market. We were expected to demonstrate our ability to ask good questions and form hypothesis, build and select models that were suitable for analyzing market specific data, interpret the results and use them to guide decision making process of an investor.
The Kaggle Competition ran till the end of the semester. We were provided with Airbnb Train data with k listings data and 66 variables. We were also provided with test data and a data dictionary that explained all the variables.
The performance of the model we built was evaluated using Area Under Curve. Submissions were restricted to one per day and the best submssion of each team were used for the leaderboard. The emphasis of the competition was to only use variables that we can explain in the model.
Our feature engineering process was focused more on the use of domain knowledge than just using a combinatorial approach to variables to get a higher accuracy. We were competing for business analytics, not for math.
For this part we were assigned a specific market. It was Las Vegas for my team. We were expected to start with market research, asking questions and answering them. This part was a mix of business case development and statistical modeling. With some knowledge gained from the model that we built in the Kaggle part, we built a Las Vegas specific model.
In essence, our goal was to ask questions, answer them using a mix of modeling and market research, and repeat the cycle. We developed a business case for an investor who wanted to buy homes and market them as Airbnb Rentals. Needless to state, the investor would like to buy homes that will achieve higher bookin rate.
Investor would want to price and manage them in a way to achieve higher booking rate. The cluster have the shape of upside-down triangle. Appendix B12 From the above analysis, we could be sure that the group that in the left bottom side of the triangle is hardly going to be the hotels that will keeps the price.
Thus, base on the cluster we have successfully figure out which hotel likely to belong to the fixed-price group, and which hotel is highly to become the dynamic. By using group hotels which follow dynamic pricing strategy, we will now study their behavior in pricing. Thus, we will fill the censored daily price of hotels which follow dynamic pricing strategy by its most current day in the same season that the uncensored daily price exists.
We assume that for hotels which are booked on a day, we cannot consider their prices at that days as over price. According to the distribution of the continuous variables under the Appendix, we found that most of the variable have the distribution hardly different for the TRUE and FALSE value of the available outcome. In that way, the train, evaluation and test datasets will hold the same proportion of low, normal and good performance listing id.
This prevents the overfitting for a certain type of listing id. They can be used to predict continuous as well as categorical outcomes, not only that they are fast and flexible to process. After running the model, we obtained the Predictor Importance char Appendix A. It is a machine learning model with the purpose of simplifying classification.
It can be used for continuous variables but is best suited for categorical variables. The major assumption used is the class has features which are independent of each other. We used SPSS expert modeller for this model. Appendix A. Kass in It splits predictors into various groups with almost even number of observations until an ideal output is obtained.
It is good for our large data set as it can be used with nominal, ordinal and continuous variables. Table of overall accuracy, Predictor Importance and expanded decision tree in Appendix A. Unlike CHAID it does not Evaluate splits and combinations within categories hence is a good model for a fast analysis in regards to the processing duration. It also does not give extra weight to predictors with more splits and hence decreases the bias. The accuracy of the model was then calculated shown in the matrix Appendix A.
CRT divides the data into sections that are homogenous with respect to the dependent variable. We obtain a terminal node which is homogeneous to all the dependent variables of the cases. It uses Gini Index for splitting nodes. It is easy to use CRT for large data sets as it can take care of outliers.
Our final model for the analysis is Neural network. One of the important parameters in selecting the best model is Predictor importance of the model, whether or not it would give sufficient importance to all the variables. What we noticed during the analysis was that, Neural Network was the only model having a balance between predictor importance Appendix A. From the metric table provided above we conclude that Neural Network has Overall Accuracy of This is a interesting result, because we noticed that people try to set the maximum nights as to appear at the top of the search results.
It can be a glitch in the algorithm which is also depicted by the model selected. Specifically, the time of initial year and ending year is when the demand is high, while the demand at mid-year is relatively lower. Also, the original available variables are not sufficient for us to effectively predict the booking ability of hotels. Instead, we must develop several new variables from existing variables.
Since the market price is formed from the price of hotels in the same area, we estimated the market price based on those hotels. As the price of hotels are not constant over the year, so does the market price. By using K-mean clustering, we had grouped hotels to several groups based on their location. This is the premise to quantify the market price. This project does not clarify which strategy is better, but rather learning the strategy of leading businesses in market to find the market price.
The most important finding in our project is determine the over price variables. Consider in the daily competitive environment, one hotel could not change the service or the amenities of the hotel immediately. Therefore, the best course of actions is altering the price according to the market changing that we have estimated previously.
That means when the season has low demand, the 2 best choices are either lowering the selling price or hibernate the business to reduce the cost. In contrast, in the high demand season, they could increase the price in a certain margin, but still can be in favor of the market.
In summary, the model will help the listings understand where they stand in the Airbnb market and suggest the pricing strategy each day for them to increase the chances of occupancy for their listings. Moreover, using the model we could predict how many days each year a listing could sell. Thus, estimating revenue from that accommodation a year can become possible.
Having good and stable estimate of revenue opens the opportunities to share the profit with another third-party for taking care of customer service and maintenance. This only shows the real daily price of hotels in the days which are not booked. If we have the sold price of each day, it will power our model to predict directly the price that the hotels should set each day over the year.
The second limitation is the software we use which is not sufficient for such Big Data of our project. In order to process in R, we have to remove the limit of the memory every time we run code. In addition, our datasets are only limit to the Airbnb fraction of the accommodation market and also the area as we only have datasets in Melbourne city.
If we can get hand-on other datasets such as traditional hotels on other booking platforms for Melbourne as well as other areas, our view about the accommodation business will be more broaden. The second suggestion is for the later limitation, we should use another software that enable us to use sharing computer resources such as computing cloud of Google.
The third suggestion is to rebuild the amenities scores model with isotonic regression. We could use different datasets from other platforms and make comparison between them. It is assumed to be a good model for the data as its very large and has a large number of variables. By obtaining an ideal output it not only understands the relationship between variables but also allocates them into almost even groups.
CHAID is more efficient with categorical variables rather than continuous values. The model building settings described below, along with the predictor importance of the 7 inputs used to analyse the model. The target variable is Availability of the room.
It is basically the difference between hotel price and market price. A positive overprice depicts that a price is set higher than what people in that market share are willing to pay given location of the hotel and amenities it offers being constant. Positive reviews are always a boom in a market where everything Is transparent and where customers can freely voice their opinions.
This reduces the processing time and results into a fast analysis. The only trade-off is accuracy when we compare it to other models. We use this method in hope to determine more manageable trees as it can simply our data by only predicting the variables at node. It shows the kind of property rented. It only makes sense to list certain properties based on demographics and the market demand. In a student friendly suburb shared accommodations are more common and hence the listings can make appropriate decision while renting the property.
These variables act as an important factor in decision making for the hosts. This shows Quest predicts all not available values correctly. As lift is almost similar and high for both models we cannot tell which is better based on lift. Thus, a model like Neural Network which could process many inputs at a same time to result output will resemble this process better. The setting of Neural Network is as follow: use Multilayer Perceptron which is the complicated option between the choices since our data is big with many variables.
It allows us to have more than one hidden layer. The stopping rule used is 15 minutes maximum for training time per component model. Classification and Regression Trees CRT divides the data into sections that are homogenous with respect to the dependent variable. That means if the classification could finally divide the dataset till its final node that have 56, observations for big branch or have less than 28, to conduct new branches, the process will finish.
Also, we set TRUE for use misclassification costs since we would rather predict a booked hotel as available than predict an available hotel as booked. Moreover, these setting can cut down the time of performing the model.
Thus, Neural Network is the best model among two. The middle layers are the hidden layers, and we only have one hidden layer. The top layer is our target variable available. Neural network generated a prediction based on the node in the input layers, then making adjustment to the weight to optimize the accuracy and reduce error for each node at the same time.
Our finding in over price would be the most important one in our project. In order to achieve the over price variable, we have to utilize the use of our data from: the survival analysis of the censored observations on price to the abnormal performance of each hotel.
In our dataset, each listing id will be grouped into a relatable market of each forming from the group of similar location and performance and price level. And if a listing wants to compete successfully in that market, their price should be adjusted to be appealing to the Airbnb customers who tend to value the saving money when they use the platform. An interesting application of this research is to predict the number of days in year that hotel could be sold, then estimate the possible revenue for the whole year.
From that, we can conduct our working capital and prepare the expenditure for each season. Moreover, it opens the door to cooperate with other people to share the profit on the basis of projected revenue and cost. In other word, we could participate on the wholesale market for Airbnb. The obvious limitation of this project is the censored data of daily price of hotels for each day.
Without this limitation, we could build a model for successfully pricing suggestion for each hotel in specific time in year or even every day. This project could be highly applicable in the real industry.
AIRBNB -DATA MINING -Analysis · 1) Neural Network The Neural network algorithms are known as 'supervised networks' because the results predicted by the model are. Data-mining: Airbnb Seattle dataset Data mining is the process of finding patterns in large datasets. It involves methods at the intersection. This is a technique used in text mining to uncover latent pattern in large collections of textual data. Essentially, topic modelling finds.