Statistical Methods Guidelines
P-Value Guidelines
When presenting P-values in text, tables, or figures, P-values greater than 0.01 should be reported to 2 decimal places (e.g., P = 0.03, P = 0.02, P = 0.07) and those between 0.01 and 0.001 to 3 decimal places (e.g., P = 0.002, P =0.007).
P-values less than 0.001 should be reported as P < 0.001.
While a significance level can be set at a value (e.g., P < 0.05), the significance of data should not be stated as P < 0.05, but rather the exact P-value. All P-values (whether significant or not) should be listed in narrative, tables, and figures. For example, authors may have significance set at P < 0.05 in their methodology; when expressing the data for vegetable intake between two samples, for example, write "group A mean intake was 2.0 + 0.3 vs. group B mean intake of 0.5 + 0.7, P = 0.02". The P-values for all predictor variables in regression should be listed in tables.
The rationale for this decision is derived from input from our statistical reviewers, who believe that the P-value is a continuous measure that expresses the compatibility between the study hypothesis and the observed data. Reporting or interpreting P < 0.05 as statistical significance with individual data represents a loss of information.
Abstracts should include significant values as described above but may reflect non-significant data as non-significant without a P-value.
cRCT Guidelines
The cRCT study design occurs when groups (eg schools, classrooms, clinics) are randomized but the data collection occurs at the individual level. This presents some challenges for the authors and reviewers. CNJ asks that the following be included in all cRCT study designs:
- The number within a group, the number of groups, and the strength of group-level dependency (intraclass correlation coefficient [ICC]) should be considered.
- A power analysis using the outcome variable of interest and inconsideration of the cluster will help define how many subjects are needed to see a “true” effect and the effect size.
- Include whether your model used fixed effects, random effects, or mixed effects with consideration of your cluster and a reference or explanation as to why this model was chosen.
Note:
Authors need to have more than 2 clusters per group to be able to see significant differences when the cluster is accounted for. Papers with only 2 treatment groups and 2 control groups (or less) will not be accepted unless they are pilot trials with outcomes related to feasibility or effect size as the main outcomes.
Participants
Quantitative data
How did the authors decide number upon the number of participants? Power analysis is the strongest rationale for determining the number of participants. If there was no power analysis another rationale for participant recruitment should be provided.
Qualitative data
How did the authors decide number upon the number of participants? Recruitment until responses were saturated is the preferred method of determining the number of participants. If this was not the rationale, a rationale should be provided.
Surveys
Was reliability of the survey tested?
Internal reliability
Usually, Cronbach α is reported for multiple items [questions] relating to a similar idea or construct.
In general, we expect Cronbach alpha to follow the recommendations of George and Mallery (2003) who suggest the following rules of thumb for evaluating alpha coefficients, "> 0.8 good, > 0.7 and < 0.8 acceptable, > 0.6 and < 0.7 questionable, > 0.5 and < 0.6 poor, and < 0.5 unacceptable." Values of 0.90 or greater suggests redundancies of items. Although acceptable in terms of publication, authors may want to acknowledge coefficient values less than acceptable (i.e., < 0.7) as a limitation of the tool.
- For Cronbach alpha less than 0.70, authors should try deleting items to improve Cronbach alpha; not combining items into a composite score; not using any of the items in the results or analysis.
- If Cronbach alpha is close to .70 and there is less than 100 participants, authors should acknowledge in limitations that this measure may not be valid because of limited participants.
- If Cronbach’s alpha is in the poor or unacceptable range (< 0.6), these items should not be used in the results or discussion.
- Of note, a large sample, n ≥ 200 will produce a more reliable Cronbach alpha, although n = 100 may be reliable.
- Discretion is left to the editor if novel, pilot, or unique data are involved. In addition, authors may employ another measures of internal consistency with citations and explanations of the statistic.
Inter-rater reliability
Usually, Kappa statistics are used to determine the reliability among several raters, educators, or surveyors.
Reliability over time or stability
Usually, a test re-test is used to determine if the items, questions, or constructs are stable over time (no difference between two time points) when there is no intervention. The statistic used is often a t-test, ANOVA, or chi-square, depending on the question and response type.
If citations are used to verify the reliability of a survey or survey items, they should have been tested in similar population.
Cross-Sectional Data Guidelines
Data analyses for cross-sectional data should begin with tests of distribution:
p-p plots, q-q plots, skewness and kurtosis, Wilk-Shapiro, Kolmogorov-Smirnov, or Lilliefors.
- If normally distributed: comparing means [t-tests] or variances [ANOVA].
- If not normally distributed, then use chi square [indicating which chi square], Mann-Whitney U, Kolmogorov-Smirnov Z.
- If log-transforming to address non-normality of variable distributions, explain how and why with references.
- If categorical with 2 categories, use binomial analyses.
For Chi-Square Analysis
If chi-square analysis was used, please indicate if chi-square goodness of fit, test of homogeneity, or test of independence was used.
Chi-square goodness of fit is used to test if the distribution of 1 categorical variable is the same or different from the expected distribution. May also be called Pearson’s chi square goodness of fit. Most widely used chi-square. Appropriate for use with unpaired date from large samples.
Example:
Are the selections of pineapple chunks, cookies, and ice cream as a dessert choice by fifth graders the same?
Results presented as chi-square statistic, df, and P. If P ≤ 0.05, we reject the null hypothesis that the desserts are selected equally (ie, there are significant differences in dessert selection).
Chi-square test of homogeneity is used to determine if 2 or more distributions of the same categorical variable come from the same population distribution.
Example:
Are the distributions of responses about frequency of eating vegetables the same for adults living on the East Coast, West Coast, and Midwest?
Results presented as chi-square statistic, df, and P. If P < 0.05, reject the null hypothesis that the distributions are from the same population (ie, the frequency of eating vegetables is significantly different across regions).
Chi-square test of independence is used to determine if there is an association among 2 or more variables. This test only determines if there is a significant association, not the strength; variables should be nominal, categorical. Cramer’s V may be used to test the strength of relationship among variables, especially if the comparison is more than a 2 x 2 table.
Example:
Is socioeconomic group associated with weight status?
Results presented as chi-square statistic, df, and P. If P ≤ 0.05, reject the null hypothesis that happiness and wealth are associated (ie, socioeconomic group is not significantly associated with weight status).
Note that significance for chi-square goodness of fit and chi-square test of homogeneity result in concluding significant differences while significance in chi-square tests of independence concludes that there are NOT associations.
Mann Whitney U tests whether two samples have been drawn from the same population.
Means, SD, medians, interquartile ranges
Means and SD should be presented if data are normally distributed. Means should be to one tenth decimal more than the measurement scale (eg, calorie intake to tenths as measured in whole calories). If the data significance cannot be visualized while adhering to this guideline, exceptions will be made accordingly. SD should be presented; SEM or SE may be used if presented groups within groups. In particular, SE should be used for nationally representative data, such as NHANES.
Medians and IQR should be presented if data are not normally distributed.
Missing data
Please be specific and justify the approach to managing missing data.
Data presentation
Were the following addressed in Methods and Results?
- Determination & treatment of outliers.
- Treatment of missing data.
- Means and SD if data have a normal distribution; IQR and median if not normally distributed.
- SEM used only if multiple samples gathered.
Decisions on data analysis
Did the authors provide a rationale for deciding to use parametric vs nonparametric analyses? Authors should:
- Tell how they decided by testing the distribution of the data for normalcy; how did they test to determine if data were normally distributed [p-p plots, q-q plots, skewness and kurtosis, Wilk-Shapiro, Kolmogorov-Smirnov, or Lilliefors].
- If normally distributed: Comparing means [t-tests] or variances [ANOVA].
- If not normal then Chi square, Mann-Whitney U, Kolmogorov-Smirnov Z.
- If categorical with 2 categories, binomial.
For Chi-Square Analyses
If Chi square analysis was used, please indicate if Chi square goodness of fit, test of homogeneity, or test of independence was used.
Chi square goodness of fit is used to test if the distribution of 1 categorical variable is the same or different from the expected distribution. May also be called Pearson’s chi square goodness of fit.
Example:
Are the selections of pineapple chunks, cookies, and ice cream as a dessert choice by fifth graders the same?
Results presented as Chi square statistic, df, and P. If P ≤ 0.05, we reject the null hypothesis that the desserts are selected equally (ie, there are significant differences in dessert selection).
Chi square test of homogeneity is used to determine if 2 or more distributions come from the same population distribution.
Example:
Are the distributions of responses about frequency of eating vegetables the same for adults living on the East coast, West coast, and Midwest?
Results presented as Chi square statistic, df, and P. If P ≤ 0.05, reject the null hypothesis that the distributions are from the same population (ie the frequency of eating vegetables is significantly different across regions).
Chi square tests of independence is used to determine if there is an association among 2 or more variables. This test only determines if there is a significant association, not the strength; variables should be nominal, categorical. Cramer’s V may be used to test the strength of relationship among variables, especially if the comparison is more than a 2 x 2 table.
Example:
Is the degree of happiness associated with the degree of wealth?
Results presented as Chi square statistic, df, and P. If P < 0.05, reject the null hypothesis that happiness and wealth are associated (ie, happiness is not significantly associated with wealth).
Note that significance for Chi square goodness of fit and Chi square test of homogeneity result in concluding significant differences while significance in Chi square tests of independence concludes that there are NOT associations.
Guidelines for Reliability and Validity Testing of Questionnaires
Face validity
Some testing of questionnaires with the target population should be completed to evaluate understanding; often accomplished with cognitive interviewing.
For other areas of reliability/validity testing, there should be a sample size rationale with appropriate reference(s) and/or n-to-item ratio, or a statement that addresses limitations that precluded such an a priori rationale.
Internal reliability
Usually, Cronbach’s a is reported for multiple items [questions] relating to a similar idea or construct.
In general, we expect Cronbach’s alpha to follow the recommendations of George and Mallery (2003) who suggest the following rules of thumb for evaluating alpha coefficients: Greater than 0.8 as ‘good’, within (0.7, 0.8) as ‘acceptable’, within (0.6, 0.7) as ‘questionable’, within (0.5, 0.6) as ‘poor’, and less than 0.5 as ‘unacceptable’. Values of 0.9 or greater may suggest redundancies of items and should be examined for possible removal. Although acceptable in terms of publication, authors may want to acknowledge coefficient values less than acceptable (i.e., < 0.7) as a limitation of the tool.
- For Cronbach’s alpha less than 0.7, authors should try deleting items to improve the value of Cronbach’s alpha; alternate possible ameliorations include not combining items into a composite score and not using any of the items in the results or analysis.
- If Cronbach’s alpha is close to 0.7 and there are less than 100 participants, authors should acknowledge in the limitations section that that this measure may not be valid and testing with a larger sample is needed to corroborate the reliability of the tool.
- If Cronbach’s alpha is in the poor or unacceptable range (< 0.6), these items should not be used in the results or discussion.
- Generally, a large sample (e.g., n ≥ 200) will produce better values for Cronbach’s alpha, although n = 100 may be sufficient to produce ‘good’ or ‘acceptable’ values.
- Discretion is left to the editor if novel, pilot, or unique data are involved. In addition, authors may employ another measures of internal reliability with citations and explanations of the statistic chosen.
Other tests of internal reliability include factor analysis and temporal reliability.
Factor analysis
Sample size is a concern for factor analyses and should be justified.
Exploratory factor analysis (EFA) studies the relationship between constructs and variables when no prior knowledge of general themes or what possible latent constructs might exist. As such, the analyses are inductive. Some may call this construct validity.
In general, we expect EFA analysis to include:
- The model fit methodology and an assessment of the fit (e.g., comparative fit index (CFI); the Tucker–Lewis Index (TLI), also known as non-normed fit index; the root mean square error of approximation (RMSEA); or the standardized root mean square of the residuals (SRMR)). Acceptable fits are indicated by a CFI and a TLI of ≥ 0.95; an RMSEA of ≤ 0.06; or an SRMR of ≤ 0.08.
- An inter-factor correlation matrix
- The Scree plot cut-off
- A cutoff value for maintaining a factor loading
- How cross loadings were handled
- Eigenvalue criteria
Confirmatory factor analysis (CFA) is used when latent constructs have been identified and their relationships are being explored. As such, the analyses are more deductive.
In general, we expect:
- The number of factors
- Factor loadings
- Model fit estimates and assessment / diagnostics
- Estimated covariance matrix
- Parameter estimates and methods used
- An inter-factor correlation matrix
- Analysis of residual indices
- Discussion or data to determine if factors have the same meaning across groups
- Convergent and divergent validity testing may also be reported
Temporal reliability or stability (test/re-test reliability)
Estimation of temporal stability is necessary for tools intended to measure the same concept or construct in more than one occasion, such as in the case of pre-to-post outcome evaluation of interventions. However, test-retest should be used in situations in which variables are not likely to change within the time interval (eg, dietary intake variables can vary, thus are not conducive to test-retest.
We expect reports of temporal reliability to include:
- The time interval between testing and reference
- The correlation used; or tests of differences (change scores) such as a t-test or the appropriate non-parametric equivalent