Rizki Mayandi Hasibuan

Paper 3

Introduction The current dataset is related to the prices of personal computers from 1993 to 1995. It has 10 columns and 6259 rows. It is exported from Stengos, T. and E. Zacharias (2005) “Intertemporal pricing and price discrimination: a semiparametric hedonic analysis of the personal computer market”, Journal of Applied https://vincentarelbundock.github.io/Rdatasets/datasets.html). Econometrics Its purpose is ( to investigate the prices of computers by considering some special features. This dataset has no need to be cleaned. Cleaning data has several aspects such as Missing values, empty columns, and rows, and identifying outliers and duplicated values. Outlier values may be in each dataset. Graph can show us an overview of the dataset for outliers. But we should be aware of the accuracy of identifying them by graphs. Statistical tests are more accurate in finding them. We use Rosner test to find outlier values based on the target variable “price”. Data Analysis The purpose of the current dataset is to investigate the changes in the prices of computers by different features such as screen display, size of ram, size of hard disk, etc. This data is exported from Intertemporal pricing and price discrimination: Journal of Applied Econometrics. The dataset is included both text and numerical data. Some variables are in the type of qualitative and some are quantitative. Therefore, there is a combination of two types of variables. The main variable is “price” which is a measurement for comparing several computers based on fees. There are 6259 rows of data. It is common if there will be some outlier values. Now, we want to perform two methods for finding them. The first one is showing on graphs and the second one is statistical test. 1. Graphs Histogram, boxplot, and QQ-plot are three plots that show the overview of the data. It is possible to see the variety of data on them. As is shown, the shape of the histogram is not like the normal distribution. Therefore, it is clear that there are some outliers. This fact is also clear in QQ-plot and boxplot. Boxplot is included min, max, 1st quantile, median, and 3rd quantile. Some values are extreme values which are shown at the top of the boxplot. QQ-plot shows the dispersion of values by the normal distribution. There are some separated values in the graph that could be outliers. Totally, it could be said that the existence of outliers is probable in this data. But for more confidence, we should perform statistical tests. 2. Statistical test There are three kinds of outlier tests. Rosner test is one of the best ones. The result is as follows: Number of outliers rows 6 1507,1992,2097,1441,1701,2469 We remove the mentioned rows and clean that. Identifying the types of variables is very important in a dataset. This is done as follows: There are 11 variables in the dataset. Three are categorical and eight are numerical. Note that the identification of some numerical variables should be done carefully. For example, “ram” is really not a numerical variable because we cannot compute the mean for this. In the first step, we can compute descriptive statistics for two numerical variables “price” and “speed”. The table of descriptive statistics is shown below: variable minimum 1st quartile median mean 3rd quartile maximum sd price 949 1794 2144 2217 2595 4694 573.81 speed 25 33 50 52 66 100 21.163 The mean price of computers is equal to 2217. This value is a good overview of the data. We see that the maximum price of a computer is 4694 and the difference between the mean and maximum. This has also happened between the minimum and mean. The median price is equal to 2144 which means that one-half of the values are higher than 2144 and one-half is lower than 2144. Actually, the median is a statistic that breaks the data into two halves. The standard deviation is equal to 573.81 which means that the values are normally 573.81 units far from the mean. The mean speed of computers is equal to 52. It is clear that the maximum speed of a computer is 100 and the difference between the mean and maximum. This has also happened between the minimum and mean. The median speed is equal to 50 which means that one-half of the values are higher than 50 and one-half is lower than 50. The standard deviation is equal to 21.163 which means that the values are normally 21.163 units far from the mean. Scatter plot is a graph that shows the dispersion of two numerical variables. This graph is draw for “speed” and “price as follows: We see that the dispersion of price versus speed is not very different in values. Frequency tables are used to show the overview of categorical variables. ram 2 4 6 8 16 32 frequency 394 2236 2318 994 297 13 The distribution of “ram” shows that 2318 computers have ram 6GB. The size of ram is related to the price and speed. Just 13 computers have 32GB. This shows that a few people use computers with ram 32Gb. Because whatever the ram goes up the price also goes up. Screen(inch) 14 Frequency 15 3661 17 1991 600 The number of computers with 14 inches screens is equal to 3661. It means that this size is very common among computers. Whatever the size of the screens goes up the frequency falls down. Just 600 computers have 17 inches of screens. Again, the size of the screen is an important factor in the price. CD Yes No Frequency 2902 3350 Having CD is an option for a computer. In this dataset, 2902 computers have it and some others don’t have. Subsetting 1. We can subset data by prices that are higher than 2000. The descriptive statistics are computed in the below table: variable price minimum 1st quartile median 2004 2249 2499 mean 3rd quartile maximum 2597 2845 4694 It is clear when some values are omitted the dataset gets smaller and the descriptive statistics represent a part of the sample. 2. Another separation could be done on “the speeds lower than 40”. The descriptive statistics are computed in the below table: variable speed minimum 1st quartile median 25 33 33 mean 3rd quartile maximum 31.26 33 33 It is clear when some values are omitted the dataset gets smaller and the descriptive statistics represent a part of the sample. In the above barchart, it is clear that the frequency of 8GB is higher than the others. In the above barchart, it is clear that the frequency of 14 inches is higher than the others. In the above barchart, it is clear that the frequency of “no” is higher than “yes”. In this step, we want to examine the effect of existing “cd” on the price of computers. Two independent t-test is the suitable test. Null hypothesis: Cd has no effect on the price of the computer Alternative hypothesis: ~null hypothesis T statistic p-value Mean price (Yes) Mean price (No) -15.997 .000 - 2111.95 Result: Because the p-value is less than 0.05, we can reject the null hypothesis. It means having cd has a big effect on the price of the computers. The difference of the prices in the two groups is clear in the above table. Summary In general, we saw that the price of computers is identified by some features. All the features have their own effect on prices. Certainly, there are some other factors that may be effective. It is understood that the prices are different at all levels of categorical variables. This dataset gives us several information about the differences in prices. For example, whatever the screen of a computer is big the price is higher. We see that fluctuations are existing in prices. We answered some questions with this dataset. What are the main reasons for the high prices of computers? Is having cd has an effect on prices? What is the frequency of different categorical variables? Or we can get further information about that such as investigating the relationship between price and speed, price and ram. References 1. https://stats.oarc.ucla.edu/r/faq/frequently-asked-questions-about-rhow-can-i- subset-a-data-setthe-r-program-as-a-text-file-for-all-the-code-on-this-page-subsettingis-a-very-important-component/ 2.https://www.projectpro.io/recipes/is-table-function-ruseful#:~:text=Table%20function%20(table())in,Frequency%20table 3. https://universeofdatascience.com/how-to-test-for-identifying-outliers-in-r/

Scheduled maintenance