# House Price Prediction Using Exploratory Data Analysis and Machine Learning with Feature Selection

## Abstract:

In many real-world applications, it is more realistic to predict a price range than to forecast a single value. When the goal is to identify a range of prices, price prediction becomes a classification problem. The House Price Index is a typical instrument for estimating house price discrepancies. This repeat sale index analyzes the mean price variation in repeat sales or refinancing of the same assets. Since it depends on all transactions, the House Price Index is poor at projecting the price of a single house. To forecast house prices effectively, this study investigates the exploratory data analysis based on linear regression, ridge regression, Lasso regression, and Elastic Net regression, with the aid of machine learning with feature selection. The proposed prediction model for house prices was evaluated on a machine learning housing dataset, which covers 1,460 records and 81 features. By comparing the predicted and actual prices, it was learned that our model outputted an acceptable, expected values compared to the actual values. The error margin to actual values was very small. The comparison shows that our model is satisfactory in predicting house prices.

## 1. Introduction

Real estate development is an important measure for a country to stimulate economic growth in the short term. As the economy improves, people tend to move from cities to rural areas, resulting in a boom of population. Housing demand rises in tandem with population growth. The growth of house prices is in lockstep with the market. In a specific region, the price of homes may spike suddenly with infrastructural development. For example, homeowners in a residential neighborhood prefer to increase the selling price of their houses, after issues like an impassable road and unstable electricity were resolved. The price increase of residential dwellings is frequently calculated by the House Price Index [1], [2], [3], [4].

Despite its importance, the House Price Index has not been sufficiently explored by researchers in this century [5], [6], [7]. The overall home value is influenced by a lot of factors, including but not limited to physical states, concepts, and locations. Physical perception can detect the size of the property, the number and space of rooms, the availability of the yard, the area of land and structures, and the age of the property [8]. The price of a house is also affected by other physical attributes, such as its size, year of construction, number of bedrooms and bathrooms, and other interior amenities [9]. Concepts allude to the numerous marketing methods used by developers to persuade potential investors. The common concepts include the proximity of the property to hospitals, markets, educational institutions, airports, major roads, etc. The location of a property has a significant impact on its pricing, because the current land price depends largely by the surroundings.

For various stakeholders (e.g., tenants, homeowners, real estate specialists, lawmakers, and urban/regional planning agencies), it is critical to understand the patterns and determinants of house pricing [10]. A computer-based prediction system can assist people in determining whether and when to purchase a home [11], [12], [13], [14], [15]. Residential real estate, the major reservoir of middle-class equity, acts as a source of capital for new businesses. The rising property prices may enhance demand by raising the income of homeowners, but may also encourage debt-financed spending and erode financial resilience.

There are in general two types of price forecasting strategies: the time series strategy to predict market patterns, and the strategy to determine the price of a commodity based on its features. The former strategy aims to clarify the relationship between current and historical rates, and the latter utilizes pricing and linear regression [16], [17], [18]. Following the second type of strategy, this paper carries out an exploratory data analysis based on linear regression, ridge regression, Lasso regression, and Elastic Net regression with feature selection.

## 2. Methodology

To estimate house prices based on the features of a relevant dataset, this study performs an exploratory data analysis based on linear regression, ridge regression, Lasso regression, and Elastic Net regression with feature selection [19], [20], [21]. The relevant data were collected and explored to analyze the dataset and identify the key sections in the dataset. Then, the data were preprocessed to make them suitable for model creation. These are the main processes of our methodology.

The house data dataset from the Machine Learning Repository at kaggle.com were used to create our model. The selected dataset contains 1,460 records, which provide the aggregated data on 81 features for homes in various suburbs. Before developing a regression model, it is essential to carry out exploratory data analysis. This allows researchers to uncover underlying trends in the data, and assists with the selection of appropriate machine learning algorithms. As a result, data exploration was performed to comprehend the features present in the dataset as well as their functions.

The selected dataset contains the per-capita crime rate per town, the percentage of residential land allocated for lots, the ratio of non-retail commercial acres by town, and the Charles River dummy variable (1 if the tract bounds the river; 0 if otherwise), Nitric oxide concentration (parts per 10 million), the typical number of rooms in a house, the percentage of owner-occupied apartments built before 1980 (represented by age), the weighed distances to employment centers, the index of radial highway accessibility, the tax rate (the total value of the property), the pupil-teacher ratio (town-specific), the proportion of dark-skinned students (town-specific), the median of owner-occupied homes, and the percentage of people with a low socioeconomic status.

Since our model adopts supervised learning, the dataset was divided properly into a training set and a test set [22], [23], [24].

The data for model training and testing must be thoroughly scrutinized before modeling. Otherwise, the constructed model would be unable to learn the patterns very quickly. As shown in Figure 1, there was no permissible missing value in the dataset.

Next, the numerical values were normalized, and the classes were encoded one at a time. Following data exploration, the most suitable features were selected against the heatmap, and the feature data were processed preliminary. The training and test sets typically have various properties.

Scaling was performed to ensure that the components are of a reasonably similar size, as the values of individual features are likely on different scales. The scale difference may weaken the performance of our model. Here, the scaling was realized using the standard scaler function of the Phyton sklearn module.

The Standard Scaler assumes that the data are naturally distributed inside each process, and scales the data to cluster around 0 with a standard deviation of 1. The mean and standard deviation of a feature are determined, and the component is installed [25], [26]:

Figure 2 describes the numerical columns and some specific percentiles statistically. Figure 3 visualizes the numerical variables as a box plot.

## 3. Exploratory Data Analysis

The Results section may be divided into subsections. It should describe the results concisely and precisely, provide their interpretation, and draw possible conclusions from the results.

The target variable, SalePrice, has a slightly positive correlation with the target. Some of the features exert a substantial impact on the sale prices of their respective classes, while some others do not. The latter are regarded as irrelevant and be discarded, during the feature selection process (Figure 4).

There are also some binary and ordinal class variables. Many variables with SalePrice exhibit a strong rising trend. In some cases, the rise is nearly linear. Newer homes are typically pricier. Houses with newly built garages are more expensive than houses with older garages, which in turn are more expensive than houses without a garage. The age of the house from the year it was built or remodeled to the year it was sold appears to have a negative relationship with the price at which the property was sold. As illustrated in Figure 5, it is not necessary to add this as a new feature, because it will be handled by linear regression.

During data preparation, the initial stage is to establish dummy variables for the remaining class variables, and the next stage is to perform a level drop. The columns of binary flat features like “Street” and “Utilities” are useless in modeling, because these features almost all belong to the same class. The two features could be removed later in the feature selection process. Finally, the original class variables are removed, and dummy variables are created for the remaining class variables (Figure 6). The latter variables can be concatenated with the leading df.

In machine learning, the concept of model training is merely an approximation of a set of parameters that can describe a particular dataset. The training procedure for the algorithm y = w + bx will give two estimates for the variables w and b. Using fewer training data will increase the estimation variance of each parameter. If fewer data tests were conducted, the estimation of the model outcomes will be more variable. Hence, the dataset was partitioned into two parts by the ratio of 70:30, namely, a training set and a test set. Figure 7 provides an example of the training data.

The model is constructed in the following steps: Selecting features from the target; generating a function for developing a linear regression model with a Train R2 of 0.94 and a Test R2 of 0.86, a sign of overfitting. This problem could be remedied through careful selection and regularization of features. There appears to be some nonlinearity in the error terms. It goes against the assumption of linear regression that the error terms are independent of each other. As illustrated in Figure 8, the problem could be corrected by changing the regressor variables or the target.

In addition, the translation of the regression equation yields log(y) = X + a. Using a statistical model with an error of 0.5, the value was obtained as 0.95 for Train R2 and 0.86 for Test R2. As shown in Figure 9, the trend of error terms varied significantly between the two scenarios. In the event of a converted answer, the error words appear to be scattered randomly.

## 4. Results and Discussion

Data exploration was performed before to better understand the dataset. The columns from the selected data provide a fascinating summary, through recursive feature elimination (RFE), linear regression, ridge regression, Lasso regression, and Elastic Net regression. This summary makes a lot of sense, for both variables are conditional and unconditional. These columns are assumed to be useless for regression tasks, such as trend prediction.

As illustrated in Figure 10, the useless features were eliminated using RFE with selected elements. After the feature elimination, both the training set and the test set were updated.

Following the RFE feature elimination, ridge regression was carried out to obtain a list of alphas. Five folds were fitted for each of the 28 candidates, for a total of 140 fits (Figure 11). Figure 12 depicts the mean training and test scores for various parameter settings considering negative mean absolute error and alpha.

There are still a lot more variables to consider. The goal of this study is to develop a prediction model that can anticipate the price of a house in a certain setting based on a set of parameters. The model will also be used to assess the intensity of correlations between the response and the predictors, which is an important goal of house price prediction models. As a result, an intelligent method must be created for feature reduction, without sacrificing model performance.

Lasso generally performs well if only a few of the predictors are used to build a model, and if these predictors have a significant influence on the response variable. Thus, Lasso regression can work as a feature selection method to eliminate unimportant variables. As shown in Figure 13, Lasso regression consists of 5-fold fittings for each of 29 candidates, accounting for a total of 145 fits.

Figure 14 depicts the mean train and test scores for various parameters obtain through Lasso regression, as well as scores for various parameters in terms of negative mean absolute error and alpha.

Lasso outperforms ridge regression when it comes to the prediction on unseen data. The regression losses of the two methods are approximately identical. Instead of using the better alpha, the salient features can be selected by slightly increasing the alpha. The above results show that Lasso functions as a feature selector capable of reducing unnecessary variables.

Elastic Net, as a regularized regression method, integrates the L1 and L2 penalties of Lasso and ridge regressions linearly. Figure 15 shows the Elastic Net fitting of five folds for each of the 252 candidates, accounting for a total of 1,260 fits. As shown in Figure 16, the Train R2 was 0.9187, the Train Mean Absolute Error was 0.0796, and the Train Mean Squared Error was 0.1105, whereas the Test R2 was 0.9104, the Test Mean Absolute Error was 0.0796, and the Test Mean Squared Error was 0.1078.

Figure 17 shows the prediction effect of our final model. It can be seen that the cost of a place is determined by the above grade living area, overall quality of home (i.e., quality of material finish), age of property, overall condition of the house, basement size, and zoning classification of the sale [27], [28], [29].

## 5. Conclusions

Through the above experimental analysis, it is possible to measure and compare the performance of various models: For ridge regression, the Train R2 is 0.9195, Train Mean Absolute Error is 0.08, and Train Mean Squared Error is 0.11; the Test R2 is 0.9059, Test Mean Absolute Error is 0.08, and Test Mean Squared Error is 0.11. For Lasso regression, the Train R2 is 0.9122, the Mean Absolute Error is 0.0821, the Mean Squared Error is 0.1148; the Test R2 is 0.9113, the Mean Absolute Error is 0.0786, and the Mean Squared Error is 0.1073. For Elastic Net, the Train R2 is 0.9177, Train Mean Absolute Error is 0.0803, Train Mean Squared Error is 0.1112; the Test R2 is 0.9084, the Mean Squared Error is 0.0802, and the Mean Squared Error is 0.109. To sum up, Lasso outperformed the other models in terms of the performance on the test set. Furthermore, Lasso was used for intelligent feature selection with a modified alpha, which managed to reduce the variable set to the 13 most significant variables. According to our final model, the variables that affect the price of a house exhibit a nonlinear relationship between the regressors and the answer.

Fadhil Muhammad Basysyar conceived of the presented idea, developed the theory and performed the computations, verified the analytical methods. Gifthera Dwilestari encouraged Fadhil Muhammad Basysyar to investigate real-world applications of House Price Prediction and supervised the findings of this work. All authors discussed the results and contributed to the final manuscript.

Fadhil Muhammad Basysyar and Gifthera Dwilestari carried out the experiment, wrote the manuscript with support and developed the theoretical formalism, performed the Data Analysis and Machine Learning with Feature Selection to the final version of the manuscript. Fadhil Muhammad Basysyar and Gifthera Dwilestari contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.

The data that support the findings of this study are available in kaggle at www.kaggle.com. These data were derived from the following resources available in the public domain: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data.

We would like to thank STMIK IKMI Cirebon, and International Information and Engineering Technology Association for the support and help with statistical analysis. We sincerely thank all the participants in our study.

The authors declare that they have no conflicts of interest.

*Acadlore Trans. Mach. Learn.*, 1(1), 11-21. https://doi.org/10.56578/ataiml010103

*Acadlore Trans. Mach. Learn.*, vol. 1, no. 1, pp. 11-21, 2022. https://doi.org/10.56578/ataiml010103

*Acadlore Transactions on AI and Machine Learning*, v 1, pp 11-21. doi: https://doi.org/10.56578/ataiml010103

*Acadlore Transactions on AI and Machine Learning*, 1, (2022): 11-21. doi: https://doi.org/10.56578/ataiml010103