FALL QUARTER, 2008
STT 740 CATEGORICAL DATA ANALYSIS
Text: Categorical Data Analysis, second edition, by Alan Agresti (ISBN 0-471-36093-7)
Meeting time & place: 4:10 pm - 6:00 pm, M W, 309 Oelman Hall
Prerequisites: STT 662 and STT 666, or permission of instructor
Course Description: Standard techniques for analyzing and describing two-dimensional contingency tables.
Logistic regression models and loglinear models developed for data structures involving
categorical response variables, including model selection procedures, diagnostics,
association graphs, and collapsibility. SAS procedures used for analysis of data sets.
WINTER QUARTER, 2009
STT 520/CMH 620 BIOSTATISTICS FOR HEALTH PROFESSIONALS
Text: Biostatistics, A Foundation for Analysis in the Health Sciences, eighth edition, by Wayne W. Daniels
Brief SOLUTIONS to selected exercises for which solutions are not given in the back of the book are given below. ERRATA in the text are given at the end of this page.
Chapter 1
1, p. 12
Descriptive statistics: that part of the science of statistics concerned with the collection, organization, and summarization of data.
2, p. 12
Inferential statistics: the procedure whereby we reach a conclusion (or draw an inference) about a population based on information contained in a sample from (or subset of) the
population.
6, p. 13
(a) quantitative: ordinal
(b) qualitative: nominal
(c) quantitative: ratio
(d) qualitative: nominal
(e) quantitative: ratio
(f) quantitative: interval
7, p. 13
(a) A: the 300 households, B: the 250 patients
(b) A: perhaps all households in the small southern town, B: all patients admitted to the hospital during the past year
(c) A: whether there was at least one school-age child present in a household, B: distance that a patient lives from the hospital
(d) A: 300, B: 250
(e) A: nominal (with responses "Yes" or "No"), B: ratio
Chapter 2
3, p. 50.
Advantage: it's simple.
Disadvantage: uses only two data values, so it's not as sensitive to the nature of the data as other measures of dispersion.
5, p. 50
The coefficient of variation, denoted by C.V., is simply the ratio of the standard deviation to the sample mean; it is multiplied by 100 in order to express it as a percentage. The C.V. is the standard deviation expressed as a percentage of the mean. As such, it provides a measure of variability that is unitless (recall that for data measured in inches, the variance is in squared inches and the standard deviation is in inches). This makes it useful in comparing the variability for two different variables that are measured in different units, e.g., a sample of students' heights measured in inches compared to their heights measured in centimeters.
7, p. 50
median
21, p. 54
If you choose the median, then you only have to do well on three of the five tests (you can get zeros on the other two tests without penalty).
26, p. 55.
(a) median (income data tend to be skewed positively)
(b) mode (these are nominal data)
(c) mean or median (weights typically have a symmetric distribution)
Chapter 4
2, p. 124
A continuous random variable assumes values that form a continuum (no "gaps") on the real number line. Typical continuous random variables are: age, blood pressure, cholesterol level, BMI, etc.
4, p. 124
The probability distribution of a continuous random variable is a representation of the relative frequency of each value assumed by the random variable. This can be done in tabular form by using a frequency table, or graphically by using a histogram.
26, p. 126.
By standardizing, rewrite the given probability as: P[-k < Z < k] = .754
Draw the picture, then use the standard normal table (Table D) to find k = 1.16.
Chapter 6
6, p. 201
point estimate +/- (reliability coefficient)(SE of the point estimate)
24, p. 203.
(358 - 483) +/- (2.58)SQRT[94864/52 + 20736/53] => [-246.4, -3.6]
The only assumption that is needed is that the samples are random and independent.
Population: women with premature rupture of membranes at term.
Chapter 7
1, p. 278
The purpose of hypothesis testing is to aid the researcher in reaching a conclusion about a population parameter by examining a sample from the population.
6, p. 279
Null hypothesis: status quo, no change, no difference, no effect
Alternative hypothesis: research hypothesis, positive or negative (nonzero) change or difference or effect
9, p. 279
It is assumed that the population variances of the two groups are equal in the small sample size case. So, the sample variances will be comparable. Then the best estimate (smallest variance) for the common population variance is the pooled sample variance.
10, p. 279
The paired comparisons test allows each subject to serve as his/her own control, thereby reducing the variation produced by extraneous factors.
20, p. 280.
Two-tailed alternative hypothesis (alternative hypothesis: means for the two groups differ).
TS: z = 1.14
RR: z < -1.96 or z > 1.96
Do not reject the null hypothesis at alpha = 0.05. P = 2 x 0.1271 = 0.2542.
Chapter 8
1, p. 368
The analysis of variance, abbreviated ANOVA, is a statistical method whereby the variability of the data is partitioned into components corresponding to the sources (or causes ) of the variability. The variability for each source adds up to the total variability. By comparing the variability for each source, hypotheses concerning equality of means can be formally tested.
2, p. 368
Subjects in the sample are randomly assigned to treatment levels.
4, p. 368
Each subject in the sample is measured repeatedly over time; the analysis is identical to that for the randomized complete block design (RCBD), also called the two-factor ANOVA without replication.
6, p. 368
Tukey's HSD test is a post-hoc analysis of a significant ANOVA and is used to determine how the population means differ
8, p. 368
Include a blocking factor in the design so that in the analysis the extraneous variation produced by that blocking factor is accounted for; also called a two-way ANOVA without replication.
9, p. 368
Interaction occurs in a two-way ANOVA when the comparison of means for the levels of one factor is not the same for every level of the other factor. For example, the mean cholesterol level may be higher in men than women for African-Americans, but lower in men than women in Caucasians.
11, p. 368
An ANOVA table is a tabular summary of the calculations performed for an analysis of variance of a set of data.
28, p. 379
(a) Two-way factorial ANOVA
(b) There is a statistically significant interaction effect between factors A and B. The mean outcome is not the same for all levels of factor A, but how they differ depends on which level of factor B is considered. Similarly, the mean outcome is not the same for all levels of factor B, but how they differ depends on which level of factor A is considered.
38, p. 382.
(a) Two-way factorial ANOVA.
(b) Two response variables: cleavage rate and birth rate of pups.
Two treatment variables: donor cell type and genotype.
(c) Factors: donor cell type (2 levels) and genotype (6 levels)
Subjects: mice
(d) Gender? Age of mouse? Size of mouse?
(e) None
(f) Donor cell type, df = 1. Genotype, df = 5.
Chapter 9
1, p. 459
(i) independent observations, (ii) outcome (Y) is normal for each X-value, (iii) variance of Y is the same for each X-value (homoscedasticity), (iv) X and Y are linearly related.
4, p. 459
The estimated slope, b, measures the change in Y for a one-unit increase in X.
6, p. 459
The coefficient of determination measures the proportion of variation in Y that is explained by (or attributable to) X; it is equal to SSR/SST.
9, p. 459
(i) estimation of slope and y-intercept
(ii) prediction of Y for a given X
13, p. 459
The sample regression equation only describes the relationship between X and Y for the range of X-values represented in the sample; the relationship between X and Y outside that range is unknown. Suppose that we collect data on Y = "height of plant" at the end of a two-week
growing period and X = "ounces of water per day given to the plant," and the resulting estimated regression equation reveals that the plant is 0.62 inches taller for each additional ounce of water given to the plant for X = 1, 2, 3, 4, and 5. Can we then extrapolate and conclude that a plant having "Jack and the Beanstalk" proportions will result by feeding the plant 10 tons of water per day?
27, p. 466
(a) both
(b) age
(c) there are two: vitamin C density and calcium density
(d) null: beta = 0, alternative: beta not equal to 0
(e) the null hypothesis is rejected at the 0.05 level of significance when vitamin C density is regressed on age, but it is not rejected at the 0.05 level of significance when calcium density is regressed on age.
(f) estimation appears to be more relevant (perhaps prediction is also relevant)
(g) the sampled population is the set of all infants and toddlers ages 15-24 months whose primary caregivers were willing/able to respond to the telephone survey.
(h) the target population would be the set of all infants and toddlers ages 15-24 months.
(i) inversely related
Chapter 10
1, p. 522
(i) independent observations, (ii) outcome (Y) is normal for each combination of X-values,
(iii) variance of Y is the same for each combination of X-values (homoscedasticity), (iv) Y is related to each X linearly.
3, p. 522
(a) The coefficient of multiple determination, denoted by R-squared, is the proportion of variability in the outcome (Y) that is attributed to the regression on the predictors (X1, X2, ..., Xk).
(b) The multiple correlation coefficient is the square root of the coefficient of multiple determination. It's another way of expressing the strength of the linear relationship between the
outcome and the predictors.
12, p. 527
(a) Regression
(b) Number of visits to emergency rooms
(c) Barometric pressure, air temperature, humidity, wind speed, etc.
(d) Null hypothesis: all regression coefficients are zero
Alternative hypothesis: not all regression coefficients are zero
(e) Coefficients for barometric pressure, air temperature, humidity, and wind speed were evidently significant.
(f) Both are relevant
(g) Tokyo children having asthma attack
(h) All Japanese children having asthma attack
(i) Number of visits is negatively correlated with barometric pressure, air temperature, humidity and air speed.
(j) E(Y) = b0 + b1*barometric + b2*temperature + b3*humidity + b4*speed, where b1, b2, b3, and b4 are negative numbers.
(k) 0.22
(l) Not known from the information given
16, p. 528
(a) Regression
(b) Femoral neck bone mineral density
(c) Weight, age, total lymphocyte count
(d) Same as for problem 12 above
(e) All three coefficients are significant
(f) Probably estimation
(g) Caucasian, healthy postmenopausal women
(h) Caucasian, healthy postmenopausal women
(i) Bone mineral density is evidently related to weight, age, and total lymphocyte count; it is not known from the information given whether the relationships are direct or inverse.
(j) E(Y) = b0 + b1*weight + b2*age + b3*lymphocyte; the values of b0, b1, b2, and b3 are
not known from the information given.
(k) 0.4
(l) Not known from the information given
Chapter 12
6, p. 662
A contingency table is the representation, in the form of a table having, say, r rows and c columns, of the cross-classification of subjects according to the levels of each of two discrete factors (say, a row factor and a column factor). The frequencies are recorded in the contingency table for each combination of levels of the two factors.
7, p. 662
df = (r-1)(c-1) = (# rows - 1)(# columns - 1)
21, p. 665
(i) Null hypothesis: ethnic group and hemoglobin level ARE NOT related
Alternative hypothesis: ethnic group and hemoglobin level ARE related
(ii) TS: X-squared = 67.802 (after much calculation)
(iii) RR: those values of X-squared that exceed 9.488
(iv) Reject the null hypothesis at the 0.05 level of significance
(v) There is strong evidence in these data to indicate that ethnic group and hemoglobin level are related. (In fact, by analyzing proportions, it appears that the proportion of subjects with hemoglobin level 10.0 or greater is highest in ethnic group C and lowest in ethnic group B.)
P-value: since 67.802 exceeds 14.860 (the 99.5th percentile of the 4-df chi-squared distribution -- see Table F, page A-41), then P < 0.005.
ERRATA in the eighth edition of Wayne W. Daniels' text:
p. vi, 7th line: change "then" to "than"
p. vii, 3rd paragraph: replace with: "Dr. Harry Khamis, Director and Mrs. Beverly Grunden, consultant at the Wright State University Statistical Consulting Center were kind enough to facilitate ..."
p. 23, 4th line from bottom of text: replace "84:5" with "84.5"
p. 36, 4th line after end of Example 2.4.1: change "represents" to "represent"
p. 38, 9th line: change "observation" to "observations"
p. 46, Example 2.5.4 solution: there are "+" signs where there should be "=" signs
p. 387, bottom: change "placebe" with "placebo"
p. 594, 1st line: change "same" to "sample"
p. 624, 1st line: change "nA•" to "nA•/n"
p. 668, middle of page: change "For each of the Exercises 35 through 54, ..." to "For each of the Exercises 35 through 53, ..."
p. A-110, 2nd line from bottom: change "96.55%" to "96.43%"
Last update: August 29, 2008
|