Sunday, May 31, 2015

Sample Median Calculation

Sample Median Calculation

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  

The median is a measure of center that cuts a distribution of values into two equal parts representing 50% of the sample. The notation that I will use when talking about medians (and percentiles) is the \( x_p \) notation, where \( x \) represents a sample value and \( p \) represents the percentage of the sample that is less than \( x \). In our case the median will be denoted as \( x_{.5} \) or \( x_{50\%} \). When we calculate the median, we are looking at finding the “middle” value instead of the “balancing” value, though in the case of data that is symmetrically distributed, the mean and median tend to be pretty close or the same.
There are a few guidelines to using the median:
  • The median is a measure of center for data that is measured on an ordinal scale (Review data classification here)
  • The median is not appropriate for nominal scale data; however continuous scale data can be treated as ordinal and statistical procedures using the median can be used on continuous scale data.
For these examples, we will use the following two samples:
#Get 10 random integer values uniformly distributed between 1 and 100 
x_even<-round(runif(10,1,100))
#sort and display the values 
x_even<-x_even[order(x_even)]
x_even
##  [1] 11 17 44 56 69 71 74 75 78 94
#Create an odd size sample that is mostly equivalent to the even size sample

x_odd<-sample(x_even,9)
x_odd<-x_odd[order(x_odd)]
x_odd
## [1] 11 44 56 69 71 74 75 78 94
The computation of the median varies if the sample size has an odd or even number. In the case of the odd size sample (sample size is not cleanly divisible by 2), the median is exactly the middle value.

Odd-Size Sample Median Calculation

As a reminder, these are the sorted values from the odd-size sample:
x_odd
## [1] 11 44 56 69 71 74 75 78 94
If we have sorted the sample from smallest to largest and it has an odd size, we want the \( \frac{n+1}{2} \)th element. In the case of our odd-size sample, we want the \( \frac{9+1}{2} = 5 \)th element.
Using this definition, we can directly get the median:
x_odd[((length(x_odd)+1)/2)]
## [1] 71
#This is the same as
x_odd[5]
## [1] 71
We can also use the median function:
median(x_odd)
## [1] 71
The median value 71 cuts 4 values off below (11, 44, 56, and 69) and 4 values above (74, 75, 78, and 94), leading us to have 50% of the sample above and below 71.

Even-Size Sample Median Calculation

As a reminder, these are the sorted values from the even-size sample:
x_even
##  [1] 11 17 44 56 69 71 74 75 78 94
If we have a situation where our sample has an even size, we can't cleanly pick out a “middle” value from which to use as the median. In this case, we want to compute a value that causes 50% of the sample to be above and below. In order to do this, we average the middle two numbers.
If we have sorted the sample from smallest to largest and it has an even size, we want to find the midpoint between the \( \frac{n}{2} \)th and \( \frac{n+2}{2} \)th element. In the case of our sample with size 10, we want to calculate the midpoint between the \( \frac{10}{2} = 5 \)th and the \( \frac{12}{2} = 6 \)th elements. In the case of our even size sample, the median is the midpoint between 69 and 71.
Using this definition, we can directly get the median:
(x_even[((length(x_even))/2)] + x_even[((length(x_even)+2)/2)])/2
## [1] 70
#This is the same as
(x_even[5]+x_even[6])/2
## [1] 70
We can also use the median function:
median(x_even)
## [1] 70
The median value 70 cuts 5 values off below (11, 17, 44, 56, and 69) and 5 values above (71, 74, 75, 78, and 94), leading us to have 50% of the sample above and below 70.

Back To Mike's Big Data, Data Mining, and Analytics Tutorial  

No comments:

Post a Comment