Diet & Disease: A dive into Power BI and R-studio
A while back a friend of mine introduced me to a book on the effects of diet on health. His interest in the subject stemmed from
the recent loss of his two brothers, both in their fifties, due to suspected heart attacks. The book was “The China Study”
by T. Colin Campbell.
Wikipedia describes it as
The Most Comprehensive Study of Nutrition Ever Conducted
and the Startling Implications for Diet, Weight Loss and Long-term Health.
I have no background in epidemiology or advanced statistics but as I was looking for case data to trial the data visualization
and manipulation functionalities of Microsoft’s Power BI product
I decided to see if I could find the
source data
for this study. I was in luck and was most grateful to Dr Campbell and his team
for taking the decision to publish the
raw data
. Whilst I found the arguments in the book compelling, I was initially surprised by what I saw in the data!
I was however not the first person to ‘spot’ seemingly contradictory evidence in the data.
I then discovered both meanings of the word ‘confound’
I had just experienced its meaning as an exclamation – oh damn it!
…and the reason for that was the same word's statistical meaning!
A confounder is a variable that influences both the dependent variable
and independent variable, causing a spurious association.
For the latter discovery I am again grateful to the good doctor for having responded to someone else’s
first off observations
Herewith follows a brief overview of some of the powers of Power BI to illustrate the dangers of confounding.
The example I have used largely follows the response of Dr Campbell but with some added variables that presented themselves in the data.
At the end of this page I also present some of the technical aspects and acknowledgements in putting this presentation together
Univariate correlations
The China Study (book) essentially concludes that people who eat a predominantly whole-food, vegan diet—avoiding animal products as a main source of nutrition,
including beef, pork, poultry, fish, eggs, cheese, and milk, and reducing their intake of processed foods and refined carbohydrates—will escape, reduce, or reverse
the development of numerous diseases.
To investigate this claim I used only the 1989 data contained
in:
- CH89M.CSV (1986-8 mainland mortality variables)
- CH89DG.CSV (1989 mainland diet and geographic variables)
- CH89Q.CSV (1989 mainland questionnaire variables)
I also simplified the analysis by ignoring any gender related diseases as the ‘questionnaire’ data didn’t cater for this field. In order for correlations to stand out I developed reports that plot ‘Disease Mortalities’ vs ‘Diet’ and split these into two visuals i.e. ‘Whole food vegan diet’ vs ‘Animal based diet’.
I excluded milk from the results for the same reason as mentioned in Dr Campbell’s response i.e. limited data. I also only considered the ‘adult’ age group of AGE 35-69.
In the correlation tables below ‘green’ is good for you and ‘red’ bad i.e. red spheres represent a positive correlation of disease mortality and foodstuff. The size of the sphere indicates the strength of the correlation (between 1 and -1), and the opacity represents the ‘statistical significance’ of the correlation. A good guide to interpreting correlations and significance (p-values) is presented here. A p-value of 0.05 means that there is only 5% chance that results from your sample occurred due to chance. I rather show this as 95% probability that if you repeated the same survey you would obtain the same result (the tool-tip shows the original p-value).
Whole food vegan diet vs Disease mortalities
Of interest above is the unexpected negative correlation between ISCHAEMIC HEART DISEASE AGE 35-69 (stand. rate/100,000) (ICD9 410-4) and
WHEAT FLOUR INTAKE (g/day/reference man, air-dry basis) of 0.43 (p-value 0.0004).
This is an example of ‘confounding which I also explore later
Some of the strongest correlations and most significant occur against NASOPHARYNGEAL CANCER AGE 35-69 (stand. rate/100,000) (ICD9 147).
I explore this further with multiple linear regression later.
Animal based diet vs Disease mortalities
I also show below, the results of the questionnaire responses vs disease mortalities to highlight ‘lifestyle’ influences such as income. Unfortunately this questionnaire data was against a smaller data set than the diet and mortality datasets, so I have chosen to exclude its variables from my multi-dimensional analysis
Questionairre vs Disease mortalities
Partial Correlation Analysis to highlight Confounding
In the 1st of the following embedded Power BI reports, you can clearly see the 'strong' and 'significant' (0.43, 0.0004) correlation between Heart disease and Wheat Flour consumption.
I also show a histogram of both variables that shows that the data does not appear to be normally distributed.
For this reason I have used the Spearman
correlation (as opposed to the more commonly used Pearson correlation)
The following 2 reports (pg2 & 3) show the correlation of the 6 confounders with the 'dependant' variable and their correlation with the 'independant'
variable, Wheat flour. For a variable to be a confounder, it should exhibit these 3-way relationships.
Wheat Fluor vs Heart Disease and Linear Multi Regression of Nasopharynx Cancer
Dr Campbell, in his response, points out that the confounding effects of; Monounsaturated fatty acid,
BMI and Green vegetable intake, could well be influencing this result. I found a few more; Income, Wine and Rice.
The correlation/significance of BMI against Heart disease, does not seem to warrant its inclusion in a Partial Correlation assessment.
In a Partial correlation assessment, the effects of the confounder are seen by including it in the (R-studio) formula…
pcor.test(Dependant variable, Independent variable, list(Confounders1,2,3 , n ), method='spearman')
I’ve also ignored income, as it comes from a smaller data set, and instead used rice as a ‘proxy’ for income.
From the table below we can see how the 'strong and statistically significant' correlation breaks down to being meaningless as each confounder is added.
This seems to prove the dangers of making hasty conclusions based on univariate analysis!
Run | Dependant variable | Independent variable | Confounder | Confounder | Confounder | Confounder | Spearman Partial correlation coef | p-value |
---|---|---|---|---|---|---|---|---|
1 | ISCHAEMIC.HEART.. | WHEAT.FLOUR... | 0.430 | 0.0004 | ||||
2 | ISCHAEMIC.HEART.. | WHEAT.FLOUR... | MONOUNSATURATED.FATTY.. | 0.303 | 0.0138 | |||
3 | ISCHAEMIC.HEART.. | WHEAT.FLOUR... | MONOUNSATURATED.FATTY.. | GREEN.VEGETABLE.. | 0.274 | 0.0287 | ||
4 | ISCHAEMIC.HEART.. | WHEAT.FLOUR... | MONOUNSATURATED.FATTY.. | GREEN.VEGETABLE.. | WINE.INTAKE | 0.227 | 0.0732 | |
5 | ISCHAEMIC.HEART.. | WHEAT.FLOUR... | MONOUNSATURATED.FATTY.. | GREEN.VEGETABLE.. | WINE.INTAKE | RICE.INTAKE | 0.046 | 0.7197 |
Multi-dimensional Regression Analysis
In the second part of the report pack (pg 4 of 6) I wanted to visualize the results of a multi-dimensional regression.
From the univariate correlations reports we see strong relationships and significance between
And
- PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN (-0.75, 5.21E-13)
- PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT (0.39, 0.0014)
- FISH.INTAKE (0.57 , 5.59E-7)
- MEAT INTAKE (red meat and poultry) (0.45, 0.0001)
Using the following linear model formula in R-studio
The above model results in a very significant p-value:2.29E-6 of the F-statistic, implying that at least, one of the predictor variables is significantly related to the outcome variable. The t-value of each predictor evaluates whether or not there is significant association between the predictor and the outcome variable. The predictors with t-values closest to zero can be removed one by one and the model re-run
1st run
Variables | Coefficients | t value |
---|---|---|
Intercept | 20.71577 | 2.791 |
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. | -2.06428 | -3.233 |
PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT.. | 0.21831 | 1.022 |
FISH.INTAKE | 0.04598 | 1.676 |
MEAT.INTAKE | -0.01685 | -0.351 |
Removing red meat, the F-stat t-val improves to .. p-value:5.882e-07, with the new t values as follows
2nd run
Variables | Coefficients | t value |
---|---|---|
Intercept | 20.29951 | 2.791 |
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. | -1.99310 | -3.316 |
PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT.. | 0.17801 | 0.995 |
FISH.INTAKE | 0.04534 | 1.668 |
Removing fat improves it further, p-value: 1.83e-07
3rd run
Variables | Coefficients | t value |
---|---|---|
Intercept | 25.36 | 4.878 |
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. | -2.23040 | -4.043 |
FISH.INTAKE | 0.05507 | 2.171 |
Removing Fish then weakens the model p-value: 2.375e-07
4th run
Variables | Coefficients | t value |
---|---|---|
Intercept | 31.8660 | 7.290 |
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. | -2.8342 | -5.783 |
Thus the best linear model of the 4 initially selected variables need only include Fish and Green vegetables for the ‘strongest’ fit.
The linear model of
is shown for 3 typical Fish intakes on pg 4
As an aside, this model may be misleading, as a quick Google search seems to indicate that the real correlation may be with salted fish and Nasopharynx cancer!
Finally the geospatial capabilities of Power BI standard map visual are shown on pg 5 as well as a choropleth using a mapbox visual add-in on pg 6. On the choropleth map use the mouse wheel to zoom, click and hold the right mouse button and move the mouse to swivel and tilt, and click and hold the left mouse button and move the mouse to move the map
Technologies utilised in this page
The starting point was the downloaded csv files. Using Power BI Query Editor the tables were linked on a composite field
of 'C_S_X_D' (County, Sex, Xiang, Date ie '83 or '89). Separate Correlation-Coefficient tables were constructed using a
Python script to compare X and Y values of fields in the raw data using the function
scipy.stats.spearmanr.
For a nice explanation of partial correlations in r, see here
The correlation charts were developed using a 'bubble matrix', this consists of a matrix table with the two variables assigned to a row and a column.
The values are svg objects (circles with radius based on cor coef). The basic technique is detailed
here.
The multiple linear regressions (MLR) were done directly in R-studio after uploading datasets used in calculating the partial correlations.
A useful explanation of the technique can found here
A helpful article on interpretting MLRs and fine tuning a predictive model is given here.
I received some helpful advice on creating choropleths in Power BI via Mapbox from
Selina
and I am also most grateful to GeoData at UC Berkeley Library for the shape file of
China's provinces
Disclaimer
While I have taken great care to present this analysis as accurately as possible, it has not been reviewed by other third parties and as such may contain errors. I receive no financial compensation from any parties associated with the subject matter.