READy Exercise
Warning: package 'knitr' was built under R version 4.5.2
Author affiliations
- University of Georgia, Athens, GA, USA.
1 Abstract
This exercise was performed to demonstrate a short and reproducible workflow one can use to clean simple and small data, and run basic statistical analyses. We were given a ‘dirty’ dataset with 3 variables (Height, Weight, Gender) and input new categorical and numerical variables (Sleep, Diet).*
We cleaned obvious data issues (e.g., improper textformat of numerical data, impossible datapoints), saved a processed dataset as a .rds file, generated two new plots, and fit linear regression models.*
Our data analysis showed no obvious relationship between weight and sleep, and height distributions overlapped across various diet categories. In the sleep and diet model, neither variable showed evidence of association with height.
This workflow was simply constructed to demonstrate a short and reproducible pipeline for analysis, and our results are limited by both sample size and the imaginary nature of our included variables.
2 Introduction
2.1 General Background Information
In total, there are 5 variables in our dataset.
Numerical variables include:
Heightin centimetersWeightin kilogramsSleep, as reported hours of sleepCategorical variables include:
Diet, reported as vegan, vegetarian, mediterranean, or omnivoreGender, as M(ale), F(emale), O(ther)
2.2 Description of data and data source
Two new variables, average sleep duration Sleep and dietary plan Diet were added to the data file ‘exampledata2’, a copy of the original ‘exampledata’ dataset that contained the Height, Weight, and Gender samples. I wanted to add in some lifestyle factors that I have seen used in studies where other quantified body metrics are added.These new variables are synthetic, and not intentially reflective of any real individual persons, but rather simply test variables for our analysis.
2.3 Questions/Hypotheses to be addressed
We sought to address a handful of questions. First, is there a correlation between variables age, height, and gender, are there any observation relationships between sleep and weight, or diet and height? Finally, using models, is height associated with sleep and diet when both predictors are included in a linear regression model?
3 Methods
Data used are as described in Introduction. Data can be found in ~/data/raw-data/exampledata2.xlsx or ~/data/processed-data/processeddata2.rds. Our workflow is as described below.
3.1 Schematic of workflow
- Add variables: Add
Sleep(numeric) andDiet(categorical) to the copy of the raw dataset (exampledata2.xlsx). - Clean data: Fix non-numeric height entries (e.g., “sixty” to 60), fix improper units and convert 6 ft → cm, and remove the extreme weight outlier (7000). Saved cleaned data as copy of ‘processeddata.rds’ as
processeddata2.rds. - Exploratory analysis: Create and save two figures: a boxplot of Height by Diet and a scatterplot of Weight vs Sleep.
- Statistical analysis: Fit linear models and save result tables files.
3.2 Data acquisition
Data were retrieved from the Andreas Handel Research Group at UGA github repository (ahgroup/ data-analysis-template) in the ~data/raw-data folder. New variables were added and are synthetic. ## Data import and cleaning
Write code that reads in the file and cleans it so it’s ready for analysis. Since this will be fairly long code for most datasets, it might be a good idea to have it in one or several R scripts. If that is the case, explain here briefly what kind of cleaning/processing you do, and provide more details and well documented code somewhere (e.g. as supplement in a paper). All materials, including files that contain code, should be commented well so everyone can follow along.
3.3 Statistical analysis
We fit multiple linear regression models using Height as the outcome variable. Models were fit with the lm() function in R version 2025.05.1+513 and model coefficients were saved using broom::tidy() as RDS tables for reproducible reporting. Models are shown below:
- Model 1:
Height ~ Weight
- Model 2:
Height ~ Weight + Gender
- Model 3:
Height ~ Sleep + Diet
Gender and Diet were treated as categorical predictors, and Weight and Sleep were treated as numeric predictors. Model output tables were saved to the results/tables/ folder.
4 Results
4.1 Exploratory/Descriptive analysis
Table 1 shows a summary of the data.
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | factor.ordered | factor.n_unique | factor.top_counts | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | Diet | 0 | 1 | 5 | 13 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | Gender | 0 | 1 | NA | NA | NA | NA | NA | FALSE | 3 | M: 4, F: 3, O: 2 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | Height | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | 165.666667 | 15.976545 | 133 | 156 | 166 | 178 | 183 | ▂▁▃▃▇ |
| numeric | Weight | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | 70.111111 | 21.245261 | 45 | 55 | 70 | 80 | 110 | ▇▂▃▂▂ |
| numeric | Sleep | 0 | 1 | NA | NA | NA | NA | NA | NA | NA | NA | 6.777778 | 1.481366 | 5 | 6 | 6 | 8 | 9 | ▅▇▁▇▂ |
4.2 Full Analysis
To get some further insight into your data, if reasonable you could compute simple statistics (e.g. simple models with 1 predictor) to look for associations between your outcome(s) and each individual predictor variable. Though note that unless you pre-specified the outcome and main exposure, any “p<0.05 means statistical significance” interpretation is not valid.
Table 2 shows the association between Height and Weight; weight was not strongly associate with height (p-value= 0.43, estimate= 0.23).
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 149.6997661 | 19.7518528 | 7.5790240 | 0.0001285 |
| Weight | 0.2277371 | 0.2708841 | 0.8407177 | 0.4282860 |
Table 3 added Gender as another predictor of Height in addition to Weight. Weight was still not significant as a predictor (p-value= 0.43) and gender did not show any clear differences in height across groups.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 149.2726967 | 23.3823360 | 6.3839942 | 0.0013962 |
| Weight | 0.2623972 | 0.3512436 | 0.7470519 | 0.4886517 |
| GenderM | -2.1244913 | 15.5488953 | -0.1366329 | 0.8966520 |
| GenderO | -4.7644739 | 19.0114155 | -0.2506112 | 0.8120871 |
Table 4 explored Sleep and Diet as predictors of Height. However, there appeared to be a large level of uncertainty (p-value= 0.81).
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 179.1176471 | 49.719417 | 3.6025693 | 0.0227062 |
| Sleep | -1.2941176 | 8.644132 | -0.1497105 | 0.8882383 |
| Dietomnivore | -11.0000000 | 30.865726 | -0.3563824 | 0.7395576 |
| Dietvegan | -0.8529412 | 21.026159 | -0.0405657 | 0.9695861 |
| Dietvegetarian | -3.7058824 | 22.319053 | -0.1660412 | 0.8761792 |
When we looked at Height distribution across different Diet categories, a few points stood out. Firstly, the categories were broadly overlapping. Both vegan and mediterranean categories had higher, though insignificantly higher, height on average.
5 Discussion/Conclusion
5.1 Summary and Interpretation
In this READy exercise, we cleaned a small dataset (originally n=10) of height/weight/gender variables with two additional variables sleep/diet. We used basic plots and three linear regression models with height as the outcome. However, our analysis showed weak correlation across the new variables. Weight ended up not being a significant predictor, and the addition of gender did not change that. Sleep and diet also did not show association with height.
5.2 Strengths and Limitations
A great strength of this project is how reproducible the workflow is. All code and results are in respective relative paths accessable via our report.qmd.
The weaknesses lie in our small dataset, and imaginary data. Sleep and diet were added with no intention of creating a mock-significant result. Therefore, we should treat the results of this study as a learning exercise, and not draw conclusions from it.
This paper (1) discusses types of analyses.
These papers (2,3) are good examples of papers published using a fully reproducible setup similar to the one shown in this template.
Note that this cited reference will show up at the end of the document, the reference formatting is determined by the CSL file specified in the YAML header. Many more style files for almost any journal are available. You also specify the location of your BibTeX reference file in the YAML. You can call your reference file anything you like.