Paper 3
Introduction
The current dataset is related to the prices of personal computers from 1993 to 1995. It
has 10 columns and 6259 rows. It is exported from Stengos, T. and E. Zacharias (2005)
“Intertemporal pricing and price discrimination: a semiparametric hedonic analysis of the
personal
computer
market”,
Journal
of
Applied
https://vincentarelbundock.github.io/Rdatasets/datasets.html).
Econometrics
Its
purpose
is
(
to
investigate the prices of computers by considering some special features. This dataset has
no need to be cleaned. Cleaning data has several aspects such as Missing values, empty
columns, and rows, and identifying outliers and duplicated values. Outlier values may be
in each dataset. Graph can show us an overview of the dataset for outliers. But we should
be aware of the accuracy of identifying them by graphs. Statistical tests are more accurate
in finding them. We use Rosner test to find outlier values based on the target variable
“price”.
Data Analysis
The purpose of the current dataset is to investigate the changes in the prices of computers
by different features such as screen display, size of ram, size of hard disk, etc. This data is
exported from Intertemporal pricing and price discrimination: Journal of Applied
Econometrics. The dataset is included both text and numerical data. Some variables are
in the type of qualitative and some are quantitative. Therefore, there is a combination of
two types of variables. The main variable is “price” which is a measurement for comparing
several computers based on fees. There are 6259 rows of data. It is common if there will
be some outlier values. Now, we want to perform two methods for finding them. The first
one is showing on graphs and the second one is statistical test.
1. Graphs
Histogram, boxplot, and QQ-plot are three plots that show the overview of the data. It is
possible to see the variety of data on them.
As is shown, the shape of the histogram is not like the normal distribution. Therefore, it
is clear that there are some outliers. This fact is also clear in QQ-plot and boxplot. Boxplot
is included min, max, 1st quantile, median, and 3rd quantile. Some values are extreme
values which are shown at the top of the boxplot. QQ-plot shows the dispersion of values
by the normal distribution. There are some separated values in the graph that could be
outliers. Totally, it could be said that the existence of outliers is probable in this data. But
for more confidence, we should perform statistical tests.
2. Statistical test
There are three kinds of outlier tests. Rosner test is one of the best ones. The result is as
follows:
Number of outliers
rows
6
1507,1992,2097,1441,1701,2469
We remove the mentioned rows and clean that.
Identifying the types of variables is very important in a dataset. This is done as follows:
There are 11 variables in the dataset. Three are categorical and eight are numerical. Note
that the identification of some numerical variables should be done carefully. For example,
“ram” is really not a numerical variable because we cannot compute the mean for this.
In the first step, we can compute descriptive statistics for two numerical variables “price”
and “speed”. The table of descriptive statistics is shown below:
variable
minimum 1st quartile median
mean
3rd quartile
maximum
sd
price
949
1794
2144
2217
2595
4694
573.81
speed
25
33
50
52
66
100
21.163
The mean price of computers is equal to 2217. This value is a good overview of the data.
We see that the maximum price of a computer is 4694 and the difference between the
mean and maximum. This has also happened between the minimum and mean. The
median price is equal to 2144 which means that one-half of the values are higher than
2144 and one-half is lower than 2144. Actually, the median is a statistic that breaks the
data into two halves. The standard deviation is equal to 573.81 which means that the
values are normally 573.81 units far from the mean.
The mean speed of computers is equal to 52. It is clear that the maximum speed of a
computer is 100 and the difference between the mean and maximum. This has also
happened between the minimum and mean. The median speed is equal to 50 which
means that one-half of the values are higher than 50 and one-half is lower than 50. The
standard deviation is equal to 21.163 which means that the values are normally 21.163
units far from the mean.
Scatter plot is a graph that shows the dispersion of two numerical variables. This graph is
draw for “speed” and “price as follows:
We see that the dispersion of price versus speed is not very different in values.
Frequency tables are used to show the overview of categorical variables.
ram
2
4
6
8
16
32
frequency
394
2236
2318
994
297
13
The distribution of “ram” shows that 2318 computers have ram 6GB. The size of ram is
related to the price and speed. Just 13 computers have 32GB. This shows that a few
people use computers with ram 32Gb. Because whatever the ram goes up the price also
goes up.
Screen(inch)
14
Frequency
15
3661
17
1991
600
The number of computers with 14 inches screens is equal to 3661. It means that this size
is very common among computers. Whatever the size of the screens goes up the
frequency falls down. Just 600 computers have 17 inches of screens. Again, the size of the
screen is an important factor in the price.
CD
Yes
No
Frequency
2902
3350
Having CD is an option for a computer. In this dataset, 2902 computers have it and some
others don’t have.
Subsetting
1. We can subset data by prices that are higher than 2000. The descriptive statistics are
computed in the below table:
variable
price
minimum 1st quartile median
2004
2249
2499
mean
3rd quartile
maximum
2597
2845
4694
It is clear when some values are omitted the dataset gets smaller and the descriptive
statistics represent a part of the sample.
2. Another separation could be done on “the speeds lower than 40”. The descriptive
statistics are computed in the below table:
variable
speed
minimum 1st quartile median
25
33
33
mean
3rd quartile
maximum
31.26
33
33
It is clear when some values are omitted the dataset gets smaller and the descriptive
statistics represent a part of the sample.
In the above barchart, it is clear that the frequency of 8GB is higher than the others.
In the above barchart, it is clear that the frequency of 14 inches is higher than the others.
In the above barchart, it is clear that the frequency of “no” is higher than “yes”.
In this step, we want to examine the effect of existing “cd” on the price of computers.
Two independent t-test is the suitable test.
Null hypothesis: Cd has no effect on the price of the computer
Alternative hypothesis: ~null hypothesis
T statistic
p-value
Mean price (Yes)
Mean price (No)
-15.997
.000
-
2111.95
Result: Because the p-value is less than 0.05, we can reject the null hypothesis. It means
having cd has a big effect on the price of the computers. The difference of the prices in
the two groups is clear in the above table.
Summary
In general, we saw that the price of computers is identified by some features. All the
features have their own effect on prices. Certainly, there are some other factors that may
be effective. It is understood that the prices are different at all levels of categorical
variables. This dataset gives us several information about the differences in prices. For
example, whatever the screen of a computer is big the price is higher. We see that
fluctuations are existing in prices. We answered some questions with this dataset. What
are the main reasons for the high prices of computers? Is having cd has an effect on
prices? What is the frequency of different categorical variables? Or we can get further
information about that such as investigating the relationship between price and speed,
price and ram.
References
1.
https://stats.oarc.ucla.edu/r/faq/frequently-asked-questions-about-rhow-can-i-
subset-a-data-setthe-r-program-as-a-text-file-for-all-the-code-on-this-page-subsettingis-a-very-important-component/
2.https://www.projectpro.io/recipes/is-table-function-ruseful#:~:text=Table%20function%20(table())in,Frequency%20table
3. https://universeofdatascience.com/how-to-test-for-identifying-outliers-in-r/