READy Exercise

Author

Nalany Richardson and Riley Herber

Figure 1: The plight of Rscientists in training
Warning: package 'knitr' was built under R version 4.5.2

Author affiliations

  1. University of Georgia, Athens, GA, USA.

1 Abstract

This exercise was performed to demonstrate a short and reproducible workflow one can use to clean simple and small data, and run basic statistical analyses. We were given a ‘dirty’ dataset with 3 variables (Height, Weight, Gender) and input new categorical and numerical variables (Sleep, Diet).*

We cleaned obvious data issues (e.g., improper textformat of numerical data, impossible datapoints), saved a processed dataset as a .rds file, generated two new plots, and fit linear regression models.*

Our data analysis showed no obvious relationship between weight and sleep, and height distributions overlapped across various diet categories. In the sleep and diet model, neither variable showed evidence of association with height.

This workflow was simply constructed to demonstrate a short and reproducible pipeline for analysis, and our results are limited by both sample size and the imaginary nature of our included variables.

2 Introduction

2.1 General Background Information

In total, there are 5 variables in our dataset.

Numerical variables include:

  • Height in centimeters

  • Weight in kilograms

  • Sleep, as reported hours of sleep

    Categorical variables include:

  • Diet , reported as vegan, vegetarian, mediterranean, or omnivore

  • Gender, as M(ale), F(emale), O(ther)

2.2 Description of data and data source

Two new variables, average sleep duration Sleep and dietary plan Diet were added to the data file ‘exampledata2’, a copy of the original ‘exampledata’ dataset that contained the Height, Weight, and Gender samples. I wanted to add in some lifestyle factors that I have seen used in studies where other quantified body metrics are added.These new variables are synthetic, and not intentially reflective of any real individual persons, but rather simply test variables for our analysis.

2.3 Questions/Hypotheses to be addressed

We sought to address a handful of questions. First, is there a correlation between variables age, height, and gender, are there any observation relationships between sleep and weight, or diet and height? Finally, using models, is height associated with sleep and diet when both predictors are included in a linear regression model?

3 Methods

Data used are as described in Introduction. Data can be found in ~/data/raw-data/exampledata2.xlsx or ~/data/processed-data/processeddata2.rds. Our workflow is as described below.

3.1 Schematic of workflow

  1. Add variables: Add Sleep (numeric) and Diet (categorical) to the copy of the raw dataset (exampledata2.xlsx).
  2. Clean data: Fix non-numeric height entries (e.g., “sixty” to 60), fix improper units and convert 6 ft → cm, and remove the extreme weight outlier (7000). Saved cleaned data as copy of ‘processeddata.rds’ as processeddata2.rds.
  3. Exploratory analysis: Create and save two figures: a boxplot of Height by Diet and a scatterplot of Weight vs Sleep.
  4. Statistical analysis: Fit linear models and save result tables files.

3.2 Data acquisition

Data were retrieved from the Andreas Handel Research Group at UGA github repository (ahgroup/ data-analysis-template) in the ~data/raw-data folder. New variables were added and are synthetic. ## Data import and cleaning

Write code that reads in the file and cleans it so it’s ready for analysis. Since this will be fairly long code for most datasets, it might be a good idea to have it in one or several R scripts. If that is the case, explain here briefly what kind of cleaning/processing you do, and provide more details and well documented code somewhere (e.g. as supplement in a paper). All materials, including files that contain code, should be commented well so everyone can follow along.

3.3 Statistical analysis

We fit multiple linear regression models using Height as the outcome variable. Models were fit with the lm() function in R version 2025.05.1+513 and model coefficients were saved using broom::tidy() as RDS tables for reproducible reporting. Models are shown below:

  1. Model 1: Height ~ Weight
  2. Model 2: Height ~ Weight + Gender
  3. Model 3: Height ~ Sleep + Diet

Gender and Diet were treated as categorical predictors, and Weight and Sleep were treated as numeric predictors. Model output tables were saved to the results/tables/ folder.

4 Results

4.1 Exploratory/Descriptive analysis

Table 1 shows a summary of the data.

Table 1: Data summary table. All caption text goes here.
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace factor.ordered factor.n_unique factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character Diet 0 1 5 13 0 4 0 NA NA NA NA NA NA NA NA NA NA NA
factor Gender 0 1 NA NA NA NA NA FALSE 3 M: 4, F: 3, O: 2 NA NA NA NA NA NA NA NA
numeric Height 0 1 NA NA NA NA NA NA NA NA 165.666667 15.976545 133 156 166 178 183 ▂▁▃▃▇
numeric Weight 0 1 NA NA NA NA NA NA NA NA 70.111111 21.245261 45 55 70 80 110 ▇▂▃▂▂
numeric Sleep 0 1 NA NA NA NA NA NA NA NA 6.777778 1.481366 5 6 6 8 9 ▅▇▁▇▂
Figure 2: Histogram of Weight Distribution.
Figure 3: Histogram of Height Distribution.

4.2 Full Analysis

To get some further insight into your data, if reasonable you could compute simple statistics (e.g. simple models with 1 predictor) to look for associations between your outcome(s) and each individual predictor variable. Though note that unless you pre-specified the outcome and main exposure, any “p<0.05 means statistical significance” interpretation is not valid.

Figure 4: Height and weight stratified by gender.
Figure 5: Weight vs Height with fitted linear regression.

Table 2 shows the association between Height and Weight; weight was not strongly associate with height (p-value= 0.43, estimate= 0.23).

Table 2: Model 1: Height ~ Weight.
term estimate std.error statistic p.value
(Intercept) 149.6997661 19.7518528 7.5790240 0.0001285
Weight 0.2277371 0.2708841 0.8407177 0.4282860

Table 3 added Gender as another predictor of Height in addition to Weight. Weight was still not significant as a predictor (p-value= 0.43) and gender did not show any clear differences in height across groups.

Table 3: Model 2: Height ~ Weight + Gender.
term estimate std.error statistic p.value
(Intercept) 149.2726967 23.3823360 6.3839942 0.0013962
Weight 0.2623972 0.3512436 0.7470519 0.4886517
GenderM -2.1244913 15.5488953 -0.1366329 0.8966520
GenderO -4.7644739 19.0114155 -0.2506112 0.8120871

Table 4 explored Sleep and Diet as predictors of Height. However, there appeared to be a large level of uncertainty (p-value= 0.81).

Table 4: Model 3: Linear regression of Height on Sleep and Diet (Height ~ Sleep + Diet).
term estimate std.error statistic p.value
(Intercept) 179.1176471 49.719417 3.6025693 0.0227062
Sleep -1.2941176 8.644132 -0.1497105 0.8882383
Dietomnivore -11.0000000 30.865726 -0.3563824 0.7395576
Dietvegan -0.8529412 21.026159 -0.0405657 0.9695861
Dietvegetarian -3.7058824 22.319053 -0.1660412 0.8761792

When we looked at Height distribution across different Diet categories, a few points stood out. Firstly, the categories were broadly overlapping. Both vegan and mediterranean categories had higher, though insignificantly higher, height on average.

Figure 6: Height distribution stratified by Diet.

5 Discussion/Conclusion

5.1 Summary and Interpretation

In this READy exercise, we cleaned a small dataset (originally n=10) of height/weight/gender variables with two additional variables sleep/diet. We used basic plots and three linear regression models with height as the outcome. However, our analysis showed weak correlation across the new variables. Weight ended up not being a significant predictor, and the addition of gender did not change that. Sleep and diet also did not show association with height.

5.2 Strengths and Limitations

A great strength of this project is how reproducible the workflow is. All code and results are in respective relative paths accessable via our report.qmd.

The weaknesses lie in our small dataset, and imaginary data. Sleep and diet were added with no intention of creating a mock-significant result. Therefore, we should treat the results of this study as a learning exercise, and not draw conclusions from it.

This paper (1) discusses types of analyses.

These papers (2,3) are good examples of papers published using a fully reproducible setup similar to the one shown in this template.

Note that this cited reference will show up at the end of the document, the reference formatting is determined by the CSL file specified in the YAML header. Many more style files for almost any journal are available. You also specify the location of your BibTeX reference file in the YAML. You can call your reference file anything you like.

6 References

1.
Leek JT, Peng RD. Statistics. What is the question? Science (New York, N.Y.). 2015;347(6228):1314–1315.
2.
McKay B, Ebell M, Billings WZ, et al. Associations Between Relative Viral Load at Diagnosis and Influenza A Symptoms and Recovery. Open forum infectious diseases. 2020;7(11):ofaa494.
3.
McKay B, Ebell M, Dale AP, et al. Virulence-mediated infectiousness and activity trade-offs and their impact on transmission potential of influenza patients. Proceedings. Biological sciences. 2020;287(1927):20200496.