The Chi-Square distribution is a mathematical distribution that is used directly or indirectly in many tests of significance. The most common use of the chi square distribution is to test differences between proportions. Although this test is by no means the only test based on the chi square distribution, it has come to be known as the chi square test.
The chi-square distribution has one parameter, its degrees of freedom (df). It has a positive skew; the skew is less with more degrees of freedom. The mean of a chi square distribution is its df .The mode is df -2 and the median is approximately df -0.7
Chi-Square Distribution:
A distribution obtained from the multiplying the ratio of sample variance to population variance by the degrees of freedom when random samples are selected from a normally distributed population
ADVERTISEMENTS:
Contingency Table:
Data arranged in table form for the chi-square independence test.
Expected Frequency:
The frequencies obtained by calculation.
ADVERTISEMENTS:
Goodness of Fit Test:
A test to see if a sample comes from a population with the given distribution.
Independence Test:
A test to see if the row and column variables are independent.
ADVERTISEMENTS:
Observed Frequency:
The frequencies obtained by observation. These are the sample frequencies.
Chi-Square Distribution:
The chi-square (X2) distribution is obtained from the values of the ratio of the distribution.
i. Sample variance and population variance multiplied by the degrees of freedom.
ii. This occurs when the population is normally distributed with population variance s2.
Symbolically, Chi-square is defined as, (with usual notations)
Properties of the Chi-Square:
a. Chi-square is non-negative. Since it is the ratio of two non-negative values, therefore must be non-negative itself.
b. Chi-square is non-symmetric.
c. For each degree of freedom, we have one chi-square distributions.
d. The degree of freedom when working with a single population variance is (n-1).
Chi-Square Probabilities:
Since the chi-square distribution is not symmetric, the method for looking up left-tail values is different from the method for looking up right tail values.
A. Area to the right—just use the area given.
B. Area to the left—the table requires the area to the right, so subtract the given area from one and look this area up in the table.
C. Area in both tails—divide the area by two. Look up this area for the right critical value and one minus this area for the left critical value.
When the degrees of freedom aren’t listed in the table, there are a couple of choices that you have.
D. You can interpolate:
1. This is probably the more accurate way.
2. Interpolation involves estimating the critical value by figuring how far the given degree of freedom are between the two degree of freedom in the table and going that far between the critical values in the table.
For Example:
Most people born in the 70’s didn’t have to learn interpolation in high school because they had calculators which would do logarithms (we had to use tables in the “good old” days). You can go with the critical value which is less likely to cause you to reject in error (type I error).
1. For a right tail test, this is the critical value further to the right (larger).
2. For a left tail test, it is the value further to the left (smaller).
3. For a two-tail test, it’s the value further to the left and the value further to the right. Note, it is not the column with the degrees of freedom further to the right, it’s the critical value which is further to the right.
Uses of Chi-Square Test:
Goodness-of-Fit Test:
The idea behind the chi-square goodness-of-fit test is to see if the sample comes from the population with the claimed distribution. Another way of looking at that is to ask if the frequency distribution fits a specific pattern.
Two values are involved, an observed value, which is the frequency of a category from a sample, and the expected frequency, which is calculated based upon the claimed distribution. The idea is that if the observed frequency is really close to the claimed (expected) frequency, then the square of the deviations will be small.
The square of the deviation is divided by the expected frequency to weight frequencies.
A difference of 10 may be very significant if 12 was the expected frequency, but a difference of 10 isn’t very significant at all if the expected frequency was 1200.
If the sum of these weighted squared deviations is small, the observed frequencies are close to the expected frequencies and there would be no reason to reject the claim that it came from that distribution. Only when the sum is large is the reason to question the distribution.
Therefore, the chi-square goodness-of-fit test is always a right tail test.
The test statistic has a chi-square distribution when the following assumptions are met:
1. The data are obtained from a random sample.
2. The expected frequency of each category must be at least.
3. This goes back to the requirement that the data be normally distributed. You’re simulating a multinomial experiment (using a discrete distribution) with the goodness-of-fit test (and a continuous distribution), and if each expected frequency is at least five then you can use the normal distribution to approximate (much like the binomial).
The following are properties of the goodness-of-fit test:
1. The data are the observed frequencies. This means that there is only one data value for each category.
2. The degree of freedom is one less than the number of categories, not one less than the sample size.
3. It is always a right tail test.
4. It has a chi-square distribution.
5. The value of the test statistic doesn’t change if the order of the categories is switched.
1. Test for Independence:
In the test for independence, the claim is that the row and column variables are independent of each other. This is the null hypothesis. The multiplication rule said that if two events were independent, then the probability of both occurring was the product of the probabilities of each occurring. This is key to working the test for independence.
If you end up rejecting the null hypothesis, then the assumption must have been wrong and the row and column variable are dependent. Remember, all hypothesis testing is done under the assumption the null hypothesis is true. The test statistic used is the same as the chi-square goodness-of fit test. The principle behind the test for independence is the same as the principle behind the goodness-of-fit test.
Test for Independence is Always a Right Tail Test:
In fact, you can think of the test for independence as a goodness of-fit test where the data is arranged into table form.
The test statistic has a chi-square distribution when the following assumptions are given below:
1. The data are obtained from a random sample.
2. The expected frequency of each category must be at least 5.
3. The following are properties of the test for independence.
4. The data are the observed frequencies.
5. The data is arranged into a contingency table.
6. The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the column variable. It is not one less than the sample size; it is the product of the two degrees of freedom.
7. It is always a right tail test.
8. It has a chi-square distribution.
9. The expected value is computed by taking the row total times.
10. The column total and dividing by the grand total.
11. The value of the test statistic doesn’t change if the order of the rows or columns is switched.
12. The value of the test statistic doesn’t change if the rows and columns are interchanged (transpose of the matrix).
Chi-Square Test-Goodness of Fit:
A number of marketing problems involve decision situations in which it is important for a marketing manager to know whether the pattern of frequencies that are observed fit well with the expected ones. The appropriate test is the c2 test of goodness of fit.
Example 1:
In consumer marketing, a common problem that any marketing manager faces is the selection of appropriate colors for package design. Assume that a marketing manager wishes to compare five different colors of package design. He is interested in knowing which of the five the most preferred one is so that it can be introduced in the market.
A random sample of 400 consumers reveals the following:
Do the consumer preferences for package colors show any significant difference?
Solution:
If you look at the data, you may be tempted to infer that Blue is the most preferred color. Statistically, you have to find out whether this preference could have arisen due to chance. The appropriate test statistic is the c2 test of goodness of fit.
Null Hypothesis:
All colors are equally preferred. Alternative Hypothesis: They are not equally preferred.
Please note that under the null hypothesis of equal preference for all colors being true, the expected frequencies for all the colors will be equal to 80. Then we are applying the formula.
We get the computed value of chi-square (X2) = 11.400. The critical value of at 5% level of significance for 4 degrees of freedom is 9.488. So, the null hypothesis is rejected. The inference is that all colors are not equally preferred by the consumers. In particular, Blue is the most preferred one. The marketing manager can introduce blue color package in the market.
Chi-Square Test of Independence:
The goodness-of-fit test discussed above is appropriate for situations that involve one categorical variable. If there are two categorical variables, and our interest is to examine whether these two variables are associated with each other, the chi-square test of independence is the correct tool to use. This test is very popular in analyzing cross-tabulations in which an investigator is keen to find out whether the two attributes of interest have any relationship with each other.
The cross-tabulation is popularly called by the term “contingency table”. It contains frequency data that correspond to the categorical variables in the row and column. The marginal totals of the rows and columns are used to calculate the expected frequencies that will be part of the computation of the statistic. For calculations, on expected frequencies, refer hyper state on test.
Example 2:
A marketing firm producing detergents is interested in studying the consumer behavior in the context of purchase decision of detergents in a specific market. This company is a major player in the detergent market that is characterized by intense competition.
It would like to know in particular whether the income level of the consumers influence their choice of the brand. Currently there are four brands in the market. Brand 1 and Brand 2 are the premium brands while Brand 3 and Brand 4 are the economy brands. A representative stratified random sampling procedure was adopted covering the entire market using income as the basis of selection. The categories that were used in classifying income level are- Lower, Middle, Upper Middle and High. A sample of 600 consumers participated in this study. The following data emerged from the study.
Cross Tabulation of Income versus Brand chosen (Figures in the cells represent number of consumers).
Analyze the cross-tabulation data above using chi-square test of independence and draw your conclusions.
Solution:
Null Hypothesis H0:
There is no association between the brand preference and income level (These two attributes are independent).
Alternative Hypothesis H1:
There is association between brand preference and income level (These two attributes are dependent).
Let us take a level of significance of 5%. In order to calculate the X2 value; you need to work out the expected frequency in each cell in the contingency table. In our example, there are 4 rows and 4 columns amounting to 16 elements. There will be 16 expected frequencies. For calculating expected frequencies, please go through hyper stat.
Relevant data tables are given below:
Observed Frequencies (These are actual frequencies observed in the survey).
Expected Frequencies (These are calculated on the assumption of the null hypothesis being true: That is, income level and brand preference are independent).
Note:
The fractional expected frequencies are retained for the purpose of accuracy. Do not found them.
Calculation;
Compute X2 ∑ (O-E) 2/ E
There are 16 observed frequencies (O) and 16 expected frequencies (E). As in the case of the goodness of fit, calculate this X2 value. In our case, the computed =131.76 as shown below:
Each cell in the table below shows (O-E)2/E.
There are 16 such cells. Adding all these 16 values, we get X2 =131.76.
The critical value of depends on the degrees of freedom. The degrees of freedom = (the number of rows-1) multiplied by (the number of colums-1) in any contingency table. In our case, there are 4 rows and 4 columns. So the degrees of freedom = (4-1). (4- 1) =9. At 5% level of significance, critical for 9 d.f. = 16.92. Therefore reject the null hypothesis and accept the alternative hypothesis.
The inference is that brand preference is highly associated with income level. Thus, the choice of the brand depends on the income strata. Consumers in different income strata prefer different brands. Specifically, consumers in upper middle and upper income group prefer premium brands while consumers in lower income and middle-income category prefers economy brands.
The company should develop suitable strategies to position its detergent products. In the marketplace, it should position economy brands to lower and middle-income category and premium brands to upper middle and upper income category.