Diet & Disease: A dive into Power BI and R-studio

A while back a friend of mine introduced me to a book on the effects of diet on health. His interest in the subject stemmed from the recent loss of his two brothers, both in their fifties, due to suspected heart attacks. The book was “The China Study” by T. Colin Campbell. Wikipedia describes it as The Most Comprehensive Study of Nutrition Ever Conducted and the Startling Implications for Diet, Weight Loss and Long-term Health. I have no background in epidemiology or advanced statistics but as I was looking for case data to trial the data visualization and manipulation functionalities of Microsoft’s Power BI product I decided to see if I could find the source data for this study. I was in luck and was most grateful to Dr Campbell and his team for taking the decision to publish the raw data . Whilst I found the arguments in the book compelling, I was initially surprised by what I saw in the data! I was however not the first person to ‘spot’ seemingly contradictory evidence in the data.
I then discovered both meanings of the word ‘confound’
I had just experienced its meaning as an exclamation – oh damn it!
…and the reason for that was the same word's statistical meaning! A confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association.
For the latter discovery I am again grateful to the good doctor for having responded to someone else’s first off observations


Herewith follows a brief overview of some of the powers of Power BI to illustrate the dangers of confounding. The example I have used largely follows the response of Dr Campbell but with some added variables that presented themselves in the data.
At the end of this page I also present some of the technical aspects and acknowledgements in putting this presentation together

Univariate correlations

The China Study (book) essentially concludes that people who eat a predominantly whole-food, vegan diet—avoiding animal products as a main source of nutrition, including beef, pork, poultry, fish, eggs, cheese, and milk, and reducing their intake of processed foods and refined carbohydrates—will escape, reduce, or reverse the development of numerous diseases.
To investigate this claim I used only the 1989 data contained in:

  • CH89M.CSV (1986-8 mainland mortality variables)
  • CH89DG.CSV (1989 mainland diet and geographic variables)
  • CH89Q.CSV (1989 mainland questionnaire variables)

I also simplified the analysis by ignoring any gender related diseases as the ‘questionnaire’ data didn’t cater for this field. In order for correlations to stand out I developed reports that plot ‘Disease Mortalities’ vs ‘Diet’ and split these into two visuals i.e. ‘Whole food vegan diet’ vs ‘Animal based diet’.
I excluded milk from the results for the same reason as mentioned in Dr Campbell’s response i.e. limited data. I also only considered the ‘adult’ age group of AGE 35-69.
In the correlation tables below ‘green’ is good for you and ‘red’ bad i.e. red spheres represent a positive correlation of disease mortality and foodstuff. The size of the sphere indicates the strength of the correlation (between 1 and -1), and the opacity represents the ‘statistical significance’ of the correlation. A good guide to interpreting correlations and significance (p-values) is presented here. A p-value of 0.05 means that there is only 5% chance that results from your sample occurred due to chance. I rather show this as 95% probability that if you repeated the same survey you would obtain the same result (the tool-tip shows the original p-value).



Whole food vegan diet vs Disease mortalities



Of interest above is the unexpected negative correlation between ISCHAEMIC HEART DISEASE AGE 35-69 (stand. rate/100,000) (ICD9 410-4) and WHEAT FLOUR INTAKE (g/day/reference man, air-dry basis) of 0.43 (p-value 0.0004).
This is an example of ‘confounding which I also explore later
Some of the strongest correlations and most significant occur against NASOPHARYNGEAL CANCER AGE 35-69 (stand. rate/100,000) (ICD9 147). I explore this further with multiple linear regression later.

Animal based diet vs Disease mortalities





I also show below, the results of the questionnaire responses vs disease mortalities to highlight ‘lifestyle’ influences such as income. Unfortunately this questionnaire data was against a smaller data set than the diet and mortality datasets, so I have chosen to exclude its variables from my multi-dimensional analysis


Questionairre vs Disease mortalities



Partial Correlation Analysis to highlight Confounding

In the 1st of the following embedded Power BI reports, you can clearly see the 'strong' and 'significant' (0.43, 0.0004) correlation between Heart disease and Wheat Flour consumption. I also show a histogram of both variables that shows that the data does not appear to be normally distributed. For this reason I have used the Spearman correlation (as opposed to the more commonly used Pearson correlation)
The following 2 reports (pg2 & 3) show the correlation of the 6 confounders with the 'dependant' variable and their correlation with the 'independant' variable, Wheat flour. For a variable to be a confounder, it should exhibit these 3-way relationships.


Wheat Fluor vs Heart Disease and Linear Multi Regression of Nasopharynx Cancer



Dr Campbell, in his response, points out that the confounding effects of; Monounsaturated fatty acid, BMI and Green vegetable intake, could well be influencing this result. I found a few more; Income, Wine and Rice.
The correlation/significance of BMI against Heart disease, does not seem to warrant its inclusion in a Partial Correlation assessment.
In a Partial correlation assessment, the effects of the confounder are seen by including it in the (R-studio) formula…

pcor.test(Dependant variable, Independent variable, list(Confounders1,2,3 , n ), method='spearman')

I’ve also ignored income, as it comes from a smaller data set, and instead used rice as a ‘proxy’ for income.
From the table below we can see how the 'strong and statistically significant' correlation breaks down to being meaningless as each confounder is added.
This seems to prove the dangers of making hasty conclusions based on univariate analysis!


Run Dependant variable Independent variable Confounder Confounder Confounder Confounder Spearman Partial correlation coef p-value
1 ISCHAEMIC.HEART.. WHEAT.FLOUR... 0.430 0.0004
2 ISCHAEMIC.HEART.. WHEAT.FLOUR... MONOUNSATURATED.FATTY.. 0.303 0.0138
3 ISCHAEMIC.HEART.. WHEAT.FLOUR... MONOUNSATURATED.FATTY.. GREEN.VEGETABLE.. 0.274 0.0287
4 ISCHAEMIC.HEART.. WHEAT.FLOUR... MONOUNSATURATED.FATTY.. GREEN.VEGETABLE.. WINE.INTAKE 0.227 0.0732
5 ISCHAEMIC.HEART.. WHEAT.FLOUR... MONOUNSATURATED.FATTY.. GREEN.VEGETABLE.. WINE.INTAKE RICE.INTAKE 0.046 0.7197


Multi-dimensional Regression Analysis

In the second part of the report pack (pg 4 of 6) I wanted to visualize the results of a multi-dimensional regression. From the univariate correlations reports we see strong relationships and significance between

NASOPHARYNX.AND.OTHER.PHARYNX.CANCER
And
  • PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN (-0.75, 5.21E-13)
  • PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT (0.39, 0.0014)
  • FISH.INTAKE (0.57 , 5.59E-7)
  • MEAT INTAKE (red meat and poultry) (0.45, 0.0001)

Using the following linear model formula in R-studio

lm(formula = Data$NASOPHARYNX.AND.OTHER.. ~ Data$PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. + Data$PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT.. + Data$FISH.INTAKE.. + Data$MEAT.INTAKE.., data = Data)


The above model results in a very significant p-value:2.29E-6 of the F-statistic, implying that at least, one of the predictor variables is significantly related to the outcome variable. The t-value of each predictor evaluates whether or not there is significant association between the predictor and the outcome variable. The predictors with t-values closest to zero can be removed one by one and the model re-run


1st run
Variables Coefficients t value
Intercept 20.71577 2.791
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. -2.06428 -3.233
PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT.. 0.21831 1.022
FISH.INTAKE 0.04598 1.676
MEAT.INTAKE -0.01685 -0.351

Removing red meat, the F-stat t-val improves to .. p-value:5.882e-07, with the new t values as follows
2nd run
Variables Coefficients t value
Intercept 20.29951 2.791
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. -1.99310 -3.316
PERCENTAGE.OF.CALORIC.INTAKE.FROM.FAT.. 0.17801 0.995
FISH.INTAKE 0.04534 1.668

Removing fat improves it further, p-value: 1.83e-07
3rd run
Variables Coefficients t value
Intercept 25.36 4.878
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. -2.23040 -4.043
FISH.INTAKE 0.05507 2.171

Removing Fish then weakens the model p-value: 2.375e-07
4th run
Variables Coefficients t value
Intercept 31.8660 7.290
PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN.. -2.8342 -5.783

Thus the best linear model of the 4 initially selected variables need only include Fish and Green vegetables for the ‘strongest’ fit.

The linear model of

NC = 25.36 -2.230 * PERCENTAGE.OF.CALORIC.INTAKE.FROM.PLANT.PROTEIN + 0.055 * FISH


is shown for 3 typical Fish intakes on pg 4

As an aside, this model may be misleading, as a quick Google search seems to indicate that the real correlation may be with salted fish and Nasopharynx cancer!

Finally the geospatial capabilities of Power BI standard map visual are shown on pg 5 as well as a choropleth using a mapbox visual add-in on pg 6. On the choropleth map use the mouse wheel to zoom, click and hold the right mouse button and move the mouse to swivel and tilt, and click and hold the left mouse button and move the mouse to move the map

Technologies utilised in this page

The starting point was the downloaded csv files. Using Power BI Query Editor the tables were linked on a composite field of 'C_S_X_D' (County, Sex, Xiang, Date ie '83 or '89). Separate Correlation-Coefficient tables were constructed using a Python script to compare X and Y values of fields in the raw data using the function scipy.stats.spearmanr. For a nice explanation of partial correlations in r, see here

The correlation charts were developed using a 'bubble matrix', this consists of a matrix table with the two variables assigned to a row and a column. The values are svg objects (circles with radius based on cor coef). The basic technique is detailed here.

The multiple linear regressions (MLR) were done directly in R-studio after uploading datasets used in calculating the partial correlations. A useful explanation of the technique can found here A helpful article on interpretting MLRs and fine tuning a predictive model is given here.

I received some helpful advice on creating choropleths in Power BI via Mapbox from Selina and I am also most grateful to GeoData at UC Berkeley Library for the shape file of China's provinces

Disclaimer

While I have taken great care to present this analysis as accurately as possible, it has not been reviewed by other third parties and as such may contain errors. I receive no financial compensation from any parties associated with the subject matter.