Tyler Marrs (Posts about r)http://tylermarrs.com/enContents © 2020; Tyler MarrsFri, 27 Mar 2020 00:45:00 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rss- Quick Stats - Descriptive Statistics Part 4 - Application of Mean, Median and Modehttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-4-application-of-mean-median-and-mode/Tyler Marrs<div><p></p><h3>
Introduction
</h3>
<p>
In the fourth part of our descriptive statistics series we will look at a real world application of the mean, median and mode. If you are unfamiliar with these concepts, please read parts 1, 2 and 3 of this blog series.
</p>
<p>
<a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-1-mean/" target="_blank">Quick Stats – Descriptive Statistics Part 1 – Mean</a><br>
<a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-2-median/" target="_blank">Quick Stats – Descriptive Statistics Part 2 – Median</a><br>
<a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-3-mode/" target="_blank">Quick Stats – Descriptive Statistics Part 3 – Mode</a>
</p>
<h3>
Customer Ratings
</h3>
<p>
Customer rating systems are used within ecommerce all of the time. They provide useful insights to customers on which items they would like to purpose. Imagine visiting Amazon.com. Typically you will try to find items that meet a particular price range and have good customer reviews.
</p>
<p style="text-align: center;">
<img alt="" class="alignnone size-full wp-image-136" height="215" src="http://tylermarrs.com/wp-content/uploads/2017/01/customer_reviews_sample.png" width="254">
</p>
<p style="text-align: center;">
<em><strong>Image courtesy of Amazon.com</strong></em>
</p>
<p>
The ratings are typically broken up into a 1 to 5 star system. Providing 1 star is the worst rating and providing 5 stars is the best rating. These ratings are broken up into a frequency table illustrating the number of ratings given. Additionally, we can see the average (mean) star rating. However, we do not typically see the median. Why do you think that is?
</p>
<p>
<img alt="" class="size-full wp-image-129 aligncenter" height="299" src="http://tylermarrs.com/wp-content/uploads/2017/01/customer_reviews.png" style="" title="" width="598">The plot above illustrates a histogram of customer ratings. It represents the frequency of ratings provided by each customer. The green line is the median and the blue line is the mean. This plot is not representative of the sample image above; it is random sample data that I generated in R (code below).
</p>
<pre class="code literal-block"><span></span><code><span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">66</span><span class="p">)</span>
<span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">c</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nf">rep</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">184</span><span class="p">))</span>
<span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">c</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nf">rep</span><span class="p">(</span><span class="m">3</span><span class="p">,</span> <span class="m">200</span><span class="p">))</span>
<span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">c</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nf">rep</span><span class="p">(</span><span class="m">4</span><span class="p">,</span> <span class="m">201</span><span class="p">))</span>
<span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">c</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="nf">rep</span><span class="p">(</span><span class="m">5</span><span class="p">,</span> <span class="m">349</span><span class="p">))</span>
<span class="n">m</span> <span class="o">=</span> <span class="nf">mean</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">md</span> <span class="o">=</span> <span class="nf">median</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="nf">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">main</span><span class="o">=</span><span class="s">"Customer Ratings"</span><span class="p">,</span> <span class="n">ylab</span><span class="o">=</span><span class="s">"Number of Customers"</span><span class="p">,</span> <span class="n">xlab</span><span class="o">=</span><span class="s">"Rating"</span><span class="p">)</span>
<span class="nf">abline</span><span class="p">(</span><span class="n">v</span> <span class="o">=</span> <span class="n">m</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="s">"blue"</span><span class="p">)</span>
<span class="nf">abline</span><span class="p">(</span><span class="n">v</span> <span class="o">=</span> <span class="n">md</span><span class="p">,</span> <span class="n">col</span> <span class="o">=</span> <span class="s">"green"</span><span class="p">)</span>
</code></pre>
<p>
Knowing that the median is 4 and the average is 3.583. Do you think it is important to know the median when deciding on a product to purchase? Probably not. If you remember our discussion on the mean and median, it is very context dependent on when to use one versus the other. Imagine if we only displayed the median. You would probably be more interested in purchasing this item.
</p>
<h3>
Conclusion
</h3>
<p>
Seeing a real world application of the mean, median and mode should provide you with more insight into these important descriptive statistics. Customer rating systems are used in ecommerce everywhere and the intuition behind them should be a little clearer. In the next blog post we will look at the standard deviation.
</p>
<p></p></div>descriptive-statisticspythonrstatisticshttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-4-application-of-mean-median-and-mode/Tue, 31 Jan 2017 07:37:07 GMT
- Quick Stats - Descriptive Statistics Part 3 - Modehttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-3-mode/Tyler Marrs<div><p></p><h3>
Introduction
</h3>
<p>
In this quick post covering descriptive statistics we will cover the mode. The mode describes the frequency of values within a given data set. To learn about the <a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-2-median/" target="_blank">median</a> or <a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-1-mean/" target="_blank">mean</a>, please click the respective links.
</p>
<h3>
Mode
</h3>
<p>
The mode is one of the easiest descriptive statistics to understand. It simply provides the value(s) that occur(s) most frequently. In most cases it is not very informative to identify a single value that occurs frequently, however it is useful to find the number of occurences for many values. To do this we can make use of a histogram to display the data.
</p>
<p>
Lets generate 1,000 numbers in the range of 1 to 20 and visualize it. The code to do this in R follows:
</p>
<pre class="code literal-block"><span></span><code><span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">20</span><span class="p">,</span> <span class="m">1000</span><span class="p">,</span> <span class="n">replace</span><span class="o">=</span><span class="bp">T</span><span class="p">)</span>
<span class="nf">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">main</span><span class="o">=</span><span class="s">"Random Values 1 to 20"</span><span class="p">)</span>
</code></pre>
<p>
In the code sample above we generated the numbers and created a histogram of the data. A histogram, shown below, displays the number of occurrences for each value observed in our data set.
</p>
<p style="text-align: center;">
<img alt="" class="size-full wp-image-125 aligncenter" height="299" src="http://tylermarrs.com/wp-content/uploads/2017/01/frequency_example.png" style="" title="" width="598">
</p>
<h3>
R
</h3>
<p>
Unfortunately, there is not a single function to obtain the mode of a given data set. However, we can create a table for the number of occurrences or visualize the data with a histogram.
</p>
<pre class="code literal-block"><span></span><code><span class="n">data</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">3</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">5</span><span class="p">)</span>
<span class="nf">table</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre>
<h3>
Python
</h3>
<p>
In the Scipy module we can obtain the mode only.
</p>
<pre class="code literal-block"><span></span><code><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span>
<span class="n">stats</span><span class="o">.</span><span class="n">mode</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre>
<p>
We can also use Scipy to reproduce a frequency table.
</p>
<pre class="code literal-block"><span></span><code><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span>
<span class="n">stats</span><span class="o">.</span><span class="n">itemfreq</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre>
<h3>
Conclusion
</h3>
<p>
The mode is one of the most under used descriptive values of a data set. It simply provides you with what value occurs most frequently. In some instances it might be nice to know, but less useful than the median or mean. R does not provide a built in method for finding the mode by itself, however Python's Scipy module provides both frequencys of all values or simply the mode.
</p>
<p>
<a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-4-application-of-mean-median-and-mode/" target="_blank">In the next post we will look at some real world applications of the mean, median and mode.</a>
</p>
<p></p></div>descriptive-statisticspythonrstatisticshttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-3-mode/Thu, 26 Jan 2017 06:12:12 GMT
- Quick Stats - Descriptive Statistics Part 2 - Medianhttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-2-median/Tyler Marrs<div><p></p><h3>
Introduction
</h3>
<p>
In this next post on descriptive statistics, the median will be discussed. In <a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-1-mean/" target="_blank">part 1 of this blog series we learned about the mean (average)</a> and found out that it can easily be skewed by large or small values. Both the median and mean are "measures of central tendency". Central tendency is essentially the center position of a distribution for a data set. Now that you are refreshed, we can discuss the median.
</p>
<h3>
Median
</h3>
<p>
The median is simple to calculate. First, we must arrange the values from least to greatest. Next, we pick the center value - this is our median when there is an odd number of values. When there are an even number of values we simply add the two center values together and divide by 2.
</p>
<p>
Here are examples of both odd and even cases:
</p>
<p>
<strong>Odd number of values</strong>
</p>
<pre>
<code>1, 2, 3, 4, 5
Median = 3</code></pre>
<p>
<strong>Even number of values</strong>
</p>
<pre>
<code>1, 2, 3, 4
Median = (2 + 3) / 2 = 2.5</code></pre>
<h3>
Median vs. Mean
</h3>
<p>
As you can see the median is not influenced by outliers. It provides the central value that most of us tend to think of as the average (mean). You might ask when you would use one over the other. This question is difficult to answer without context. In some situations it may make more sense to use one or both. Let's look at some grades for 10 random students to see how the mean and median look.
</p>
<p>
The raw values are as follows:
</p>
<pre>
<code>20, 20, 25, 27, 29, 37, 45, 55, 90, 100</code></pre>
<p style="text-align: center;">
<img alt="" class="size-full wp-image-114 aligncenter" height="299" src="http://tylermarrs.com/wp-content/uploads/2017/01/exam_grades_median_mean.png" style="" title="" width="598">
</p>
<p>
In the plot, shown above, the blue line represents the mean - 44.8%. The green line represents the median - 33%. Clearly there is a difference between these two measures of central point.
</p>
<h3>
R
</h3>
<p>
To calculate the median in R - simply use the built in function.
</p>
<pre class="code literal-block"><span></span><code><span class="n">grades</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">c</span><span class="p">(</span><span class="m">20</span><span class="p">,</span> <span class="m">20</span><span class="p">,</span> <span class="m">25</span><span class="p">,</span> <span class="m">27</span><span class="p">,</span> <span class="m">29</span><span class="p">,</span> <span class="m">37</span><span class="p">,</span> <span class="m">45</span><span class="p">,</span> <span class="m">55</span><span class="p">,</span> <span class="m">90</span><span class="p">,</span> <span class="m">100</span><span class="p">)</span>
<span class="n">median_grade</span> <span class="o">&</span><span class="n">lt</span><span class="p">;</span><span class="o">-</span> <span class="nf">median</span><span class="p">(</span><span class="n">grades</span><span class="p">)</span>
</code></pre>
<h3>
Python
</h3>
<p>
To calculate the median in Python we use Numpy. Import Numpy and use the median function.
</p>
<pre class="code literal-block"><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="n">grades</span> <span class="o">=</span> <span class="p">[</span><span class="mi">20</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">27</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">37</span><span class="p">,</span> <span class="mi">45</span><span class="p">,</span> <span class="mi">55</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="mi">100</span><span class="p">]</span>
<span class="n">median_grade</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">median</span><span class="p">(</span><span class="n">grades</span><span class="p">)</span>
</code></pre>
<h3>
Conclusion
</h3>
<p>
In this post we discussed the median and how it differentiates from the mean. We also established that context is important when deciding to use one versus the other. The mean and median are important values used in statistics and our every day lives.
</p>
<p>
The next article will discuss the mode - another useful descriptive statistic.
</p>
<p></p></div>descriptive-statisticsmedianpythonrstatisticshttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-2-median/Wed, 25 Jan 2017 05:28:35 GMT
- Quick Stats - Descriptive Statistics Part 1 - Meanhttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-1-mean/Tyler Marrs<div><p></p><h3>
Introduction
</h3>
<p>
Descriptive statistics is a meaningful part of statistics in which a value provides informative insight about a particular data set. Some examples of descriptive stastics include the values mean, median and mode. In this post I will provide some concepts of the mean and how to calculate this value using both R and Python. All of the examples in Python will make use of the library <a href="https://www.scipy.org/" target="_blank">SciPy</a> or <a href="http://www.numpy.org/" target="_blank">Numpy</a>.
</p>
<h3>
Mean
</h3>
<p>
The mean provides an average of the observed values. It can be useful to show a simple summary illustrating the average, however it can become very misleading. Keep in mind that the terms mean and average can be used interchangeably.
</p>
<p>
<br>
Let's look at the following numbers:
</p>
<pre>
<code>Set 1
1, 2, 3, 4, 5, 6, 7
Set 2
1, 2, 3, 4, 5, 20, 100</code></pre>
<p>
In set 1 the average is meaningful as it respresents a fair distribution of the values. However, in set 2 the average is right skewed due to the large numbers. In other words, the mean does not represent the majority of the observations. Imagine if these sets represented exam scores. This would mean that average for set 1 is 4% and average of set 2 is 21.67%. You should get the idea that only 2 students in set 2 performed well while the majority of the students performed very poorly. It is very important to know the underlying data and how the mean represents it.
</p>
<p style="text-align: center;">
<img alt="" class="alignnone size-full wp-image-89" height="323" src="http://tylermarrs.com/wp-content/uploads/2017/01/set1.png" style="" title="" width="472"><img alt="" class="alignnone size-full wp-image-90" height="323" src="http://tylermarrs.com/wp-content/uploads/2017/01/set2.png" style="" title="" width="472">
</p>
<p>
The two plots above provide a better illustration of how skewed the data becomes with larger values. The red line marks the average. Keep in mind that this can happen with very small values as well.
</p>
<h3>
<br>
R
</h3>
<p>
In R it is very easy to calculate the mean. Below, we create a variable to hold some values and simply use the built-in mean function.
</p>
<pre class="code literal-block"><span></span><code><span class="n">my_data</span> <span class="o">=</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">5</span><span class="p">)</span>
<span class="nf">mean</span><span class="p">(</span><span class="n">my_data</span><span class="p">)</span>
</code></pre>
<h3>
Python
</h3>
<p>
Similarly Python makes it very easy to calculate the mean. Ensure that you have the Numpy package installed to follow this snippet.
</p>
<pre class="code literal-block"><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="n">my_data</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span>
<span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">my_data</span><span class="p">)</span>
</code></pre>
<h3>
Conclusion
</h3>
<p>
The mean is a very useful but sometimes misleading descriptive statistic. It is very common to hear about averages in our daily lives, but you should know that it may not always be a good representation of the data.
</p>
<p>
<a href="http://tylermarrs.com/quick-stats-descriptive-statistics-part-2-median/" target="_blank">The next article will cover the median - another useful descriptive statistic that is not as fragile.</a>
</p>
<p>
</p>
<p></p></div>descriptive-statisticsmeanpythonrstatisticshttp://tylermarrs.com/posts/quick-stats-descriptive-statistics-part-1-mean/Mon, 16 Jan 2017 17:46:43 GMT