# Statistical Methods

The object of a quantitative physical experiment is to measure some quantity either directly or indirectly. No physical measurement is ever exact. The accuracy is always limited by the degree of refinement of the apparatus used and by the skill of the observer. Hence the numerical measurement of a physical quantity must include the error associated with the measurement.

Because random errors are subject to the laws of chance their effect in the experiment may be lessened by taking a large number of measurements and using the average value as the best estimate of the true value. In what follows we will assume that systematic errors are negligible.

### Part 1: Two different ways of describing uncertainty

There are two basic ways of indicating error limits of a measured quantity. The first first measure of error approximation is already contained in the specification of a number when it is recorded with the appropriate number of significant figures and uncertainty. This is the method specified in your "Measurement and Uncertainty" resource.

However, measurement uncertainty only tells part of the story. There are often many random errors which contribute to the overall uncertainty of a measurement. As more and more data is collected, the researcher will begin to see a spread of measurements centered around the mean. Using statistics, the researcher can predict the probability of future data points falling near the mean.

According to the theory of probability, random errors are distributed in frequency according to the well-known Gaussian error curve.

Figure 1: Gaussian distribution

The Gaussian distribution function has the following mathematical form:

(Eq. 1)

The factor outside of the exponential is for correct normalization (to make the total area under the curve equal to 1). The letters μ and σ are the parameters that characterize the Gaussian; μ is equal to the mean, and σ is the standard deviation.

The mean (μ) is just the average of all the values. For a grouping of values (y), the mean is

(Eq. 2)

In the above equation, N is the number of data points.

The standard deviation (σ) is the answer to the question "how far do we have to go away from the mean before the values start to get really improbable?" A small σ means that the distribution is narrow, and sharply peaked around its mean value; values far from the mean are very unlikely. A large standard deviation means that the distribution is broad and shallow.

Having an equation for the Gaussian is handy because we can numerically integrate it. If we do so, we get two important results:

- 68% of the total area under the curve falls between μ-σ and μ+σ, i.e. within one standard deviation of the mean.
- 95% of the total area falls between μ-2σ and μ+2σ. So it is rare for a Gaussian variable to have a value which is more than 2>σ from the mean.

Hence from this curve one may construct measures of error, each of which correlates with a certain probability that the actual error is less than or equal to the difference between the measure of error and the mean. For a grouping of values (y), the measure of error may be chosen to be an integral multiple of σy.

(Eq. 3)

Once we've taken enough measurements to have a good idea what σ is, we know that the next measurement we take has a 68% chance to be within σ of the mean value. Therefore, when using a single measurement as a guess for the mean value, the uncertainty of that guess is just σ, the standard deviation.

However, most of the time, we don't need to rely on a single value—after all, if we took repeated measurements, we already know that our best guess to the true value is not any one measurement but rather the mean of all the measurements. This best guess is obviously better than a guess using only one value, and intuitively it should be a better guess the more data points we have. So how good is this estimate? The answer is related to the standard error, which is given by:

(Eq. 4)

Therefore, If you want to cut the uncertainty in half, you need to take four times as many measurements! This is ﬁne if it means increasing N from 1 to 4, but as N gets bigger you start to see rapidly diminishing returns; At some point, you are better off trying to reduce σ by improving your technique or your measuring device;

The "Standard Error" is useful for knowing how much data you should take in any given experiment. The more data points you take, the smaller the standard error will be. That being said, you should go into an experiment with an idea of a reasonable and acceptable tolerance for the standard error.

### Part 2: Creating a Linear Regression

In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable y, based on the value of an independent variable x. Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate data set.

Suppose y is a dependent variable, and x is an independent variable. The population regression line is:

(Eq. 5)

where b is a constant, m is the regression coefficient, x is the value of the independent variable, and y is the value of the dependent variable.

When the regression parameters (b and m) are defined as described above, the regression line has the following properties.

- The line minimizes the sum of squared differences between observed values and predicted values (the y values computed from the regression equation).
- The regression line passes through the mean of the x values and through the mean of the y values.
- The regression constant (b) is equal to the y intercept of the regression line.
- The regression coefficient (m) is the average change in the dependent variable for a 1-unit change in the independent variable (x). It is the slope of the regression line.

The least squares regression line is the only straight line that has all of these properties.

Start with the equation y = mx + b. We assume that the independent variable is x and we record a datum point yi at each value xi. We then select a straight line and adjust its slope "m" and intercept "b" to get a "best fit" to the measured yi with an error called the residual (ε). The residuals are the differences between the measured y values (yi) and the predicted y values (mx + b) at each value of x. The residuals are the green lines in the figure below.

(Eq. 6)

Figure 2. Residuals in green

By definition, we get a best fit when the sum of the squares of the residuals at each xi is a minimum. A best-fit line is one that minimizes the sum of the squares of these residuals. We do this by taking the partial derivatives of the sum of of the squared residuals with respect to m, then b, then setting both of them to zero. When we do this, we are left with two equations that contain m and b.

(Eq. 7)

(Eq. 8)

From the second equation, it is clear that the mean x value and the mean y value are a point on the best-fit line. By solving the simultaneous equations above, we obtain the "best" values of m and b:

(Eq. 9)

where

(Eq. 10)

Once you have the slope, you can plug it back into Eq. 8 and solve for the intercept.

### Part 3: Finding the standard error of the slope.

Many times it is helpful to determine the standard error in the slope. First calculate the standard deviation of the residuals (σε) .

(Eq. 11)

Because m and b are both estimated values, we lose two "degress of freedom." Thus, we use N-2 instead of just N. By propagating the error in the residuals, we find the error in the slope to be

(Eq. 12)

and the error in the intercept to be

(Eq. 13)

### Activity 1: Error distribution of a simple pendulum

In this lab, the period of a simple pendulum is determined using a stopwatch. Many stopwatches will give you a precision of 1 or 10 milliseconds. However, you should intuitively know that the actual error in your time meaasurement is much greater due to your own reaction time, and your ability to determine the "borders" of one complete oscillation of the pendulum. In order to get a better measure of the error involved, we must make multiple measurements and then analyze using the mean and standard deviation.

In these activites, work in groups of two (not four).

There are three simple pendula of different lengths at the front of the room. Your instructor will set them in motion.

For each pendulum, measure the time of one oscillation. Do NOT time multiple oscillations then calculate the average. You really are just timing ONE oscillation here. Do this ten times for each pendulum, giving you ten values for the period of one oscillation for each pendulum. Enter these values into Microsoft Excel.

The formula for the period of a simple pendulum is as follows:

(Eq. 14)

After you collect your period data, your instructor will provide you with the L (length of the pendulum) values. Please assume that the error associated with the L values is insignificant.

Because we want to create a linear regression, let's turn this into a linear formula by squaring both sides of Eq. 14. This shows that there is a linear relationship between T^{2} and L.

Because we want to estimate the error in the T^{2} values, let's go ahead and square each of our period measurements. When you do that, record these values of T^{2}in an Excel spreadsheet. Format the data so that it looks like the following (the values below are for demonstration purposes only. They do not represent real data):

Let Excel calculate the mean using the =average() function and the standard deviation (SD) using the =stdev.p() function. The ".p" means that excel will use the entire population when calculating the deviation. Also calculate the standard error (SE) by dividing the standard deviation by the square root of N.

Print this data for your notebook.

In your notebook, for each value of length, display your T^{2} results in this form:

T^{2} = (mean of T^{2}) ± (standard error of T^{2})

Now we can make a plot in Excel that will take this error into account.

Click the Insert tab, then click on the "Scatter" graph icon in the chart area. In the box that pops up, choose the scatter graph option that has no lines in it. A graph window will appear.

With the "Design" tab selected, click "Select Data." In the data selection window that pops up, click "Add" to add a series. For the Series X Values, select the row containing the three pendulum lengths. For the Series Y Values, select the row containing the mean values of T^{2}.

You should now have a chart that plots three points. Examine each point to make sure it was plotted correctly. To the right of the charting area is a label that says "Series 1." You may delete this label.

Now you will add a trendline. With your chart window selected, click the "Layout" tab in the menu above, then click "Trendline." Select "More Trendline Options" and then select "Display Equation on Chart." The linear equation will be shown.

Now for the error bars. With your chart window selected, click the "Layout" tab in the menu above, then click "Error Bars." In the box that pops up, choose "More Error Bar Options." Select "Custom," then "Specify Value." For both the positive and negative bars, drag your mouse to select your SE values. You should now see error bars on your data pionts. Although you didn't ask for them, Excel also defaults to creating horizontal error bars. Since you have no horizontal error, click on these bars and delete them.

Last of all, let Excel compute the error in the slope and intercept for you. This is done using the "Linest" function. Linest is an extension of the Trendline results. The syntax of thisfunction is: =Linest(known_Y_ values,known_X_values, Constant,Statistics). If Constant is TRUE (or 1), Linest calculates the intercept. Otherwise, the intercept is set to y = 0. If Statistics is TRUE, Linest returns regression statistics.

Because Linest returns several values, it is called an array function. Linest returns an array that is two columns wide and two rows long with values as follows:

m | b |

SE of m | SE of b |

- In your spreadsheet, highlight any four empty cells in two rows and two columns as above.
- In the formula entry area at the top, type the formula =LINEST(Select the three T
^{2}mean values, Select the three L values, 1, 1) - Press CTRL+SHIFT+ENTER

When you do this correctly, Excel will fill in those four cells with the values you need.

Write out the equation for the linear regression. Include these SE values as the uncertainty in the slope and the y-intercept.

What is the slope of your graph? ____________

From the slope, determine g. (Hint: The slope is not g but g can be found using the slope. You will have to multiply by a constant to get g.)

Using the error in the slope, propagate the error in g.

Compare your value of g in Boone to the accepted value (9.7953 m/s^{2}) using percent difference. Does the accepted value of g fall within your standard error?

### Activity 2: Bin Histogram

In the above activity, you observed that random errors led to a variation in your experimental values for the period of a simple pendulum. Assuming that your errors were in fact random, and that your data resembled a Gaussian distribution, you used probablilty theory to predict the uncertainty of your measurements.

Using this same assumption, you would predict that 68% of all future data would fall within one standard deviation of the mean, and the 95% of all future data would fall within two standard deviations of the mean (see Figure 1).

However, error is not always random, and data is not always distributed in a way that resembles a Gaussian error curve.

For a Gaussian distribution, the probability that a data point will be some exact value is actually very, very small -- almost zero. However, you can reliably predict the probability that a data point will fall within a certain range. For example, we predict that 68% of the data points that follow a Gaussian distribution will fall within one standard deviation of the mean value.

First, you will need to sort the column of data associated with the 0.8 m pendulum from highest to lowest. However, you can't sort a column that contain equations. In order to work around this, copy the column and paste it nearby as "values." This is done by right-clicking in the cell you want to paste data to, then under "Paste Options" choose "values."

Now that the 0.8 m pendulum data is sorted, what percentage of the data points fall within one SD on either side of the mean? What percentage of the data points fall within two SDs on either side of the mean? Show your calculations.

If the answers were close to 68% and 95%, then chances are good that your data follows a Gaussian distribution. One way to visually test for a gaussian distribution is to create a bin histogram. A "bin" is simply a range of values that any data point may fall into. If you were to place data points into evenly-sized, adjacent bins, you would expect that most of you data would go into the bins that were closest to the mean value of the data.

Let's do this for the data associated with the 0.8 meter pendulum, since you have already sorted this data from lowest to highest.

Make the bin size equal to a third of the standard deviation. This way, you can assume that 95% of the data should fit into twelve bins.

Make a column in your spreadsheet titled **bins**. The first bin value should be the lowest data value + the bin size. The next bin value should be the previous bin value + the bin size. Keep adding bins into this column until you have enough bins to hold all the data. As it happens, this data range spans ten bins. Your data may require more or fewer bins.

Under the "Data" tab, click "Data Analysis" group in the upper right. If you do not see this group, follow the instructions found here to learn how to install the Analysis ToolPac needed for analyzing histograms. If you already have the "Data Analysis" group, you don't need to follow this link.

In the option box that comes up, click Histogram. Another window will open. By clicking the icon to the right of the range input box, you can then select a data grouping from your spreadsheet as the range. For the input range, drag the mouse to select all of your data under the 0.8 m column. Press the Enter key to continue. For the bin range, select the values in your bin column. Select the Chart Output option then click Okay. A new worksheet will open that lists the frequency of data points in each bin, and which also plots the histogram.

Does your data appear to have a Gaussian distribution?

### Activity 3: Analysis of a large data set

One factor to consider when creating a data set is the number of data points that are collected. In order to have a distribution, multiple measurements must be made. The more measurements, the more well-defined that distribution will be. As we saw above, the strength of an estimate is related to the standard error (see Eq. 4).

The "Standard Error" is useful for knowing how much data you should take in any given experiment. The more data points you take, the smaller the standard error will be.

Here is a link to a large data set created using those same pendula:

For this large data set, create a histogram as described in Activity 2 above, using only the 0.8 m length. How does this histogram compare to the one you created using your own data?

Print this histogram for your notebook.