Statistical Methods Guidelines

P-Value Guidelines

When presenting P-values in text, tables, or figures, P-values greater than 0.01 should be reported to 2 decimal places (e.g., P = 0.03, P = 0.02, P = 0.07) and those between 0.01 and 0.001 to 3 decimal places (e.g., P = 0.002, P =0.007).

 P-values less than 0.001 should be reported as P < 0.001.

While a significance level can be set at a value (e.g., P < 0.05), the significance of data should not be stated as P < 0.05, but rather the exact P-value. All P-values (whether significant or not) should be listed in narrative, tables, and figures. For example, authors may have significance set at P < 0.05 in their methodology; when expressing the data for vegetable intake between two samples, for example, write "group A mean intake was 2.0 + 0.3 vs. group B mean intake of 0.5 + 0.7, P = 0.02". The P-values for all predictor variables in regression should be listed in tables.

 

The rationale for this decision is derived from input from our statistical reviewers, who believe that the P-value is a continuous measure that expresses the compatibility between the study hypothesis and the observed data. Reporting or interpreting P < 0.05 as statistical significance with individual data represents a loss of information.

Abstracts should include significant values as described above but may reflect non-significant data as non-significant without a P-value.


cRCT Guidelines

The cRCT study design occurs when groups (eg schools, classrooms, clinics) are randomized but the data collection occurs at the individual level. This presents some challenges for the authors and reviewers. CNJ asks that the following be included in all cRCT study designs:

Note:
Authors need to have more than 2 clusters per group to be able to see significant differences when the cluster is accounted for. Papers with only 2 treatment groups and 2 control groups (or less) will not be accepted unless they are pilot trials with outcomes related to feasibility or effect size as the main outcomes.

Participants
Quantitative data
How did the authors decide number upon the number of participants? Power analysis is the strongest rationale for determining the number of participants. If there was no power analysis another rationale for participant recruitment should be provided.

Qualitative data
How did the authors decide number upon the number of participants? Recruitment until responses were saturated is the preferred method of determining the number of participants. If this was not the rationale, a rationale should be provided.

Surveys
Was reliability of the survey tested?

Internal reliability

Usually, Cronbach α is reported for multiple items [questions] relating to a similar idea or construct.

In general, we expect Cronbach alpha to follow the recommendations of George and Mallery (2003) who suggest the following rules of thumb for evaluating alpha coefficients, "> 0.8 good, > 0.7 and < 0.8 acceptable, > 0.6 and < 0.7 questionable, > 0.5 and < 0.6 poor, and < 0.5 unacceptable." Values of 0.90 or greater suggests redundancies of items. Although acceptable in terms of publication, authors may want to acknowledge coefficient values less than acceptable (i.e., < 0.7) as a limitation of the tool.

Inter-rater reliability

Usually, Kappa statistics are used to determine the reliability among several raters, educators, or surveyors.

Reliability over time or stability

Usually, a test re-test is used to determine if the items, questions, or constructs are stable over time (no difference between two time points) when there is no intervention. The statistic used is often a t-test, ANOVA, or chi-square, depending on the question and response type.

If citations are used to verify the reliability of a survey or survey items, they should have been tested in similar population.

Cross-Sectional Data Guidelines

Data analyses for cross-sectional data should begin with tests of distribution:

p-p plots, q-q plots, skewness and kurtosis, Wilk-Shapiro, Kolmogorov-Smirnov, or Lilliefors.

 

For Chi-Square Analysis

If chi-square analysis was used, please indicate if chi-square goodness of fit, test of homogeneity, or test of independence was used.

Chi-square goodness of fit is used to test if the distribution of 1 categorical variable is the same or different from the expected distribution. May also be called Pearson’s chi square goodness of fit. Most widely used chi-square. Appropriate for use with unpaired date from large samples.

Example: 
Are the selections of pineapple chunks, cookies, and ice cream as a dessert choice by fifth graders the same?
Results presented as chi-square statistic, df, and P. If P ≤ 0.05, we reject the null hypothesis that the desserts are selected equally (ie, there are significant differences in dessert selection).

Chi-square test of homogeneity is used to determine if 2 or more distributions of the same categorical variable come from the same population distribution.
Example: 
Are the distributions of responses about frequency of eating vegetables the same for adults living on the East Coast, West Coast, and Midwest?
Results presented as chi-square statistic, df, and P. If P < 0.05, reject the null hypothesis that the distributions are from the same population (ie, the frequency of eating vegetables is significantly different across regions).

Chi-square test of independence is used to determine if there is an association among 2 or more variables. This test only determines if there is a significant association, not the strength; variables should be nominal, categorical. Cramer’s V may be used to test the strength of relationship among variables, especially if the comparison is more than a 2 x 2 table.

Example: 
Is socioeconomic group associated with weight status?
Results presented as chi-square statistic, df, and P. If P ≤ 0.05, reject the null hypothesis that happiness and wealth are associated (ie, socioeconomic group is not significantly associated with weight status).

Note that significance for chi-square goodness of fit and chi-square test of homogeneity result in concluding significant differences while significance in chi-square tests of independence concludes that there are NOT associations.

Mann Whitney U tests whether two samples have been drawn from the same population.

Means, SD, medians, interquartile ranges

Means and SD should be presented if data are normally distributed. Means should be to one tenth decimal more than the measurement scale (eg, calorie intake to tenths as measured in whole calories). If the data significance cannot be visualized while adhering to this guideline, exceptions will be made accordingly. SD should be presented; SEM or SE may be used if presented groups within groups. In particular, SE should be used for nationally representative data, such as NHANES.

Medians and IQR should be presented if data are not normally distributed.


Missing data

Please be specific and justify the approach to managing missing data.

Data presentation

Were the following addressed in Methods and Results?


Decisions on data analysis

Did the authors provide a rationale for deciding to use parametric vs nonparametric analyses? Authors should:


For Chi-Square Analyses

If Chi square analysis was used, please indicate if Chi square goodness of fit, test of homogeneity, or test of independence was used.

Chi square goodness of fit is used to test if the distribution of 1 categorical variable is the same or different from the expected distribution. May also be called Pearson’s chi square goodness of fit.
Example:
Are the selections of pineapple chunks, cookies, and ice cream as a dessert choice by fifth graders the same?
Results presented as Chi square statistic, df, and P. If P ≤ 0.05, we reject the null hypothesis that the desserts are selected equally (ie, there are significant differences in dessert selection).

Chi square test of homogeneity is used to determine if 2 or more distributions come from the same population distribution.
Example:
Are the distributions of responses about frequency of eating vegetables the same for adults living on the East coast, West coast, and Midwest?
Results presented as Chi square statistic, df, and P. If P ≤ 0.05, reject the null hypothesis that the distributions are from the same population (ie the frequency of eating vegetables is significantly different across regions).

Chi square tests of independence is used to determine if there is an association among 2 or more variables. This test only determines if there is a significant association, not the strength; variables should be nominal, categorical. Cramer’s V may be used to test the strength of relationship among variables, especially if the comparison is more than a 2 x 2 table.
Example:
Is the degree of happiness associated with the degree of wealth?
Results presented as Chi square statistic, df, and P. If P < 0.05, reject the null hypothesis that happiness and wealth are associated (ie, happiness is not significantly associated with wealth).

Note that significance for Chi square goodness of fit and Chi square test of homogeneity result in concluding significant differences while significance in Chi square tests of independence concludes that there are NOT associations.

Guidelines for Reliability and Validity Testing of Questionnaires

Face validity

Some testing of questionnaires with the target population should be completed to evaluate understanding; often accomplished with cognitive interviewing.

For other areas of reliability/validity testing, there should be a sample size rationale with appropriate reference(s) and/or n-to-item ratio, or a statement that addresses limitations that precluded such an a priori rationale.

Internal reliability

Usually, Cronbach’s a is reported for multiple items [questions] relating to a similar idea or construct.

In general, we expect Cronbach’s alpha to follow the recommendations of George and Mallery (2003) who suggest the following rules of thumb for evaluating alpha coefficients: Greater than 0.8 as ‘good’, within (0.7, 0.8) as ‘acceptable’, within (0.6, 0.7) as ‘questionable’, within (0.5, 0.6) as ‘poor’, and less than 0.5 as ‘unacceptable’. Values of 0.9 or greater may suggest redundancies of items and should be examined for possible removal. Although acceptable in terms of publication, authors may want to acknowledge coefficient values less than acceptable (i.e., < 0.7) as a limitation of the tool.

Other tests of internal reliability include factor analysis and temporal reliability.

Factor analysis

Sample size is a concern for factor analyses and should be justified.

Exploratory factor analysis (EFA) studies the relationship between constructs and variables when no prior knowledge of general themes or what possible latent constructs might exist. As such, the analyses are inductive. Some may call this construct validity.

In general, we expect EFA analysis to include:

Confirmatory factor analysis (CFA) is used when latent constructs have been identified and their relationships are being explored. As such, the analyses are more deductive.

In general, we expect:

Temporal reliability or stability (test/re-test reliability)

Estimation of temporal stability is necessary for tools intended to measure the same concept or construct in more than one occasion, such as in the case of pre-to-post outcome evaluation of interventions. However, test-retest should be used in situations in which variables are not likely to change within the time interval (eg, dietary intake variables can vary, thus are not conducive to test-retest.

We expect reports of temporal reliability to include: