Tuesday, October 15, 2019

Why grouping fruit and vegies together in an interventional study is probably a bad idea.

https://cdn.steemitimages.com/DQmW7DZ4yip4NDNknYyocc279GKfAxNet1eAgbQu9o4EJyM/image.png

In this blog post I want to look at nutrition groups. Specifically, I want to look, in an objective way, at the nutrition profile of fruit compared to other food groups. In nutrition studies, fruit is often grouped with vegetables, but is this actually a fair grouping? I want to use a public nutrition database and some basic Python Pandas functionality to look if this is justified. # Getting the data We start of wit getting some nutrition ingfo from https://fineli.fi/fineli/fi/avoin-data The unpacked zip file contains a number of csv files that we will load into pandas. ```python %matplotlib inline import math import numpy as np import pandas import matplotlib.pyplot as plt component_value = pandas.read_csv("component_value.csv", sep=';', decimal=',') food = pandas.read_csv("food.csv", sep=';', encoding='latin1') foodname = pandas.read_csv("foodname_EN.csv", sep=';', encoding='latin1') fuclass = pandas.read_csv("fuclass_EN.csv", sep=';') component_value = component_value[component_value['EUFDNAME'].apply(lambda x: isinstance(x, (str)))] eufdname = pandas.read_csv("eufdname_EN.csv", sep=';') ``` # Normalizing the data The next step is to normalize the data on nutrients, so we can work with normalized vector distance from here on. The way we do this is, we take the mean and standard deviation for each of the nutrients in the nutrition database and we use this info to normalize the nutrient numbers to z-values. We create a new data frame with foods as rows and normalized nutrients as columns. ```python df = pandas.merge(left=food[["FOODID","FUCLASS"]], right=fuclass[["THSCODE", "DESCRIPT"]], \ how='left', left_on="FUCLASS", right_on="THSCODE")[["FOODID","DESCRIPT"]] foodshort = foodname[["FOODID","FOODNAME"]] df = pandas.merge(how='left', right=df, left=foodshort, left_on="FOODID", right_on="FOODID") for comp in component_value["EUFDNAME"].unique(): filtered = component_value[component_value["EUFDNAME"] == comp][["FOODID","BESTLOC"]] std = filtered.loc[:,"BESTLOC"].std(axis=0) mean = filtered.loc[:,"BESTLOC"].mean(axis=0) filtered[comp] = (filtered["BESTLOC"] - mean) / std filtered = filtered[["FOODID", comp]] df = pandas.merge(left=df,right=filtered, how='left', left_on='FOODID', right_on='FOODID') df = df.fillna(0) ``` # Food groups Now that we have our normalized data, lets have a look at fruit, as a group, and see how that group compares to the other groups in our data set. ```python fruit = df.loc[df['DESCRIPT'] == 'Fruits'] vegies = df.loc[df['DESCRIPT'] == 'Vegetables'] reference_vectordistance = np.linalg.norm((vegies.mean() - fruit.mean()).values[1:]) rowlist = [] for foodtype in df["DESCRIPT"].unique(): if foodtype != 'Fruits': other = df.loc[df['DESCRIPT'] == foodtype] vectordistance = np.linalg.norm((other.mean() - fruit.mean()).values[1:]) if vectordistance/reference_vectordistance < 10.0001: row = dict() row["foodtype"] = foodtype row["reldistance"] = vectordistance/reference_vectordistance rowlist.append(row) peergroups = pandas.DataFrame(rowlist) with pandas.option_context('display.max_rows', None, 'display.max_columns', None): print(peergroups.sort_values(by=['reldistance'])) ``` foodtype reldistance 106 Baby fruit and berry product 0.254584 24 Juices 0.324371 23 Fruit and berry salads 0.419500 88 Vegetable salads 0.433896 27 Juice drink 0.499597 89 Fruit and berry soups 0.512050 82 Fruit and berry dishes other than pies 0.572982 67 Other drinks 0.619280 84 Vegetable soups 0.652777 107 Baby vegetable product 0.655973 65 Soft drink with sugar 0.710715 20 Vegetable juices 0.716880 95 Pulse soups 0.729302 83 Potato dishes 0.742103 86 Cooked vegetables 0.744533 90 Vegetable sauces 0.753597 87 Vegetable dishes 0.764830 109 Baby fish dish 0.774588 68 Drinking water 0.778795 97 Meat soups 0.780833 42 Yoghurt 0.805619 93 Pulse sauces 0.806330 38 Milks skimmed 0.809094 108 Baby meat dish 0.825446 70 Porridge 0.829892 116 Sport drink 0.831736 100 Poultry soups 0.842000 41 Cultured milks 0.845855 59 Coffee 0.846159 60 Tea 0.850300 15 Cooked potatoes 0.889403 85 Mixed salads 0.897260 94 Pulse dishes 0.897811 37 Milks >2% fat 0.907533 39 Soured milks 0.912002 81 Milk desserts 0.920313 44 Milks <2% fat 0.928574 69 Drinks with artificial sweeteners 0.935823 112 Seafood soup 0.936404 79 Milk sauces 0.999358 17 Vegetables 1.000000 43 Quark 1.000320 120 Dietary supplement 1.013490 5 Savoury sauces 1.037593 25 Berries 1.054466 103 Fish soups 1.064546 16 Fried potatoes, French fries 1.106613 91 Prepared salads with mayonnaise 1.109879 78 Panncakes 1.132018 101 Poultry dishes 1.169911 19 Mushroom dishes 1.174364 49 Ice cream 1.194230 111 Seafood dishes, crustacean and molluscs 1.225429 96 Meat sauces 1.226042 92 Meat dishes 1.229997 105 Dessert sauces 1.237735 40 Fermented milk products, other 1.259492 102 Poultry sauces 1.285769 12 Rice 1.313129 45 Cream 1.329619 110 Seafood sauces 1.334021 118 Pizza 1.337806 73 Savoury bakery 1.343459 10 Pasta 1.345270 4 Condiments 1.378994 75 Sandwiches and burgers 1.403498 74 Sweet bakery 1.447861 26 Jams and marmalades 1.461037 66 Ciders 1.513203 104 Fish sauces 1.523617 77 Buns 1.609077 80 Egg dishes 1.611915 98 Fish dishes 1.682396 76 Wheat bread 1.689758 99 Sausages 1.708877 72 Bread, mixed flour 1.730758 21 Pulses 1.739442 46 Cheese, unripened, fresh cheese 1.739644 18 Canned vegetables 1.781269 117 Cold cuts, sausages 1.807501 57 Crustaceans and molluscs 1.904892 53 Cold cuts, meat 1.998762 71 Rye bread 2.126735 114 Savoury biscuits 2.146351 11 Sweet biscuits 2.227579 3 Miscellaneous ingredients 2.309356 52 Chicken and other birds 2.392425 64 Other alcoholic beverages 2.423097 7 Cereal bars 2.439470 48 Processed cheese 2.451552 50 Steaks and chops 2.459782 13 Breakfast cereals 2.524057 8 Flour 2.544412 22 Pulse products 2.647546 55 Fish 2.676576 47 Cheese, ripened cheese 2.809589 14 Savoury snacks 2.811795 61 Beers 2.856793 115 Meal replacements 2.947267 2 Chocolate 3.004876 62 Wines 3.229641 1 Confectionery 3.240278 34 Blended spread < 55 % 3.340992 113 Infant formulas and human milk 3.342981 36 Margarine and fat spread < 55% 3.349092 9 Nuts, seeds and dried fruits 3.505547 119 Sport food 3.511651 51 Meat products 3.523952 58 Egg 3.555769 56 Fish products 3.588763 33 Salad dressings and mayonnaises 3.628788 6 Spices 4.116614 29 Blended spread >= 55 % 4.244458 54 Offal dishes 4.397559 0 Sugar and syrups 4.436681 35 Margarine and fat spread >= 55% 4.517717 28 Butter, milk fat 5.055117 30 Cooking and industrial fat 5.283595 32 Animal fat 5.471986 31 Oils 7.149777 63 Spirits 8.044268 Notice how SSBs as a group are 29% closer, nutritionally to fruit as a group than vegetables as a group are. At least according to our simple metric. Even drinking water and yogurt are. This isn't exactly giving us much to justify fruits being grouped with vegetables in nutition studies. And this is just for the distance between the *mean* of these food groups. Lets pick a random fruit, lets say a *banana* and compare it different individual foods outside of the vegetables group. # A Specific fruit We looked at this for groups, now lets look at a specific fruit. One of my own favorites, a melon. And lets not look at SSB, drinking water and yogurt, but lets look at foods generally thought of as unhealthy that few people will think of comparing to a healthy peice of fruit. We take a look at McDonalds food and at chocolates and see how they compare to a melon. ```python banana = df.loc[df['FOODNAME'] == 'HONEYDEW MELON, WITHOUT SKIN'] for header in df.head(): if not header in ["FOODID","FOODNAME","DESCRIPT"]: df[header] = df[header] - banana[[header]].values[0] rowlist = [] for index,row in df.iterrows(): food = row.values[1] foodtype = row.values[2] vector = row.values[3:] distance = np.linalg.norm(vector)/reference_vectordistance if "MCDONALD" in food or foodtype == "Chocolate": row = dict() row["food"] = food row["distance"] = distance rowlist.append(row) peerfood = pandas.DataFrame(rowlist) with pandas.option_context('display.max_rows', None, 'display.max_columns', None): print(peerfood.sort_values(by=['distance'])) ``` distance food 21 0.956963 MILKSHAKE, VANILLA, MCDONALD'S 23 1.515775 HAMBURGER, MCFEAST, MCDONALD'S 10 1.589669 HAMBURGER, BEEF AND WHEAT ROLL, MCDONALD'S 11 1.647408 HAMBURGER, CHEESE BURGER, MCDONALD'S 13 1.647526 HAMBURGER, DOUBLE BURGER, BIG MAC, MCDONALD'S 12 1.728808 HAMBURGER, CHICKEN BURGER, MCDONALD'S 22 1.986608 HAMBURGER, DOUBLE CHEESE BURGER, MCDONALD'S 1 2.542666 CHOCOLATE CONFECTION FILLED WITH MARMALADE 14 2.915553 CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX 24 2.915631 CHOCOLATE CONFECTION FILLED WITH CHOCOLATE 18 2.978043 SUFFELI CHOCOLATE BAR, WAFFLE, TOFFEE FILLING ... 7 3.174270 CHOCOLATE BAR, LOW-FAT 6 3.224222 CHOCOLATE BAR WITH FILLING, AVERAGE 2 3.336511 CHOCOLATE BAR, AVERAGE 15 3.350807 SUFFELI PUFFI SNACKS,PUFFED CORN AND CHOCOLATE... 3 3.354410 CHOCOLATE, PLAIN, DARK CHOCOLATE 8 3.409912 CHOCOLATE, WHITE CHOCOLATE 0 3.431798 CHOCOLATE, AVERAGE 16 3.469495 CHOCOLATE NUT SPREAD 20 3.509146 CHOCOLATE, MILK CHOCOLATE WITH HAZELNUTS 4 3.763362 CHOCOLATE, MILK CHOCOLATE 17 3.789385 KINDER CHOCOLATE EGG 9 3.863914 RICE CHOCOLATE 19 4.175258 CHOCOLATE, PLAIN, DARK CHOCOLATE, 80% 5 5.557062 CHOCOLATE, ARTIFICIALLY SWEETENED Notice that a milk shake is closer to a melon than the average vegetable. Now let us pick a few nice ones from this lis. The milkshake, the double cheese burger and the twix candy bar and see how different vegetables compare to these: ```python count1 = 0 count2 = 0 count3 = 0 tcount = 0 for index,row in df.iterrows(): food = row.values[1] foodtype = row.values[2] vector = row.values[3:] distance = np.linalg.norm(vector)/reference_vectordistance if "Vegetables" == foodtype: tcount += 1 if distance > 2.915553: count3 +=1 if distance > 1.986608: count2 +=1 if distance > 0.956963: count1 +=1 print("* A milkshake is nutritionally closer to a melon than", count1,"out of", tcount,"vegetables.") print("* A double cheeseburger is nutritionally closer to a melon than", count2,"out of", tcount, "vegetables.") print("* A Twix candy bar is nutritionally closer to a melon than", count3,"out of", tcount,"vegetables.") ``` * A milkshake is nutritionally closer to a melon than 71 out of 103 vegetables. * A double cheeseburger is nutritionally closer to a melon than 26 out of 103 vegetables. * A Twix candy bar is nutritionally closer to a melon than 13 out of 103 vegetables. Still making sense to you to run interventional studies that put vegetables and fruits in the same group? I would argue it doesn't. But then, maybe you don't trust the normalized nutrition vector. Lets have a quick look at what the normalized nutrition actually looks like for a banana vs brocoli, kale, twix and a McDonald's milkshake. ```python compare = ['MILKSHAKE, VANILLA, MCDONALD\'S','KALE','BROCCOLI','CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX'] part = df.loc[df['FOODNAME'].isin(compare)] part = part.set_index('FOODNAME').drop(['FOODID','DESCRIPT'], axis=1) part = part.transpose().rename(columns={"CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX": "TWIX", "MILKSHAKE, VANILLA, MCDONALD'S": "MILKSHAKE"}) names = eufdname.drop(['LANG'], axis=1).rename(columns={"THSCODE": "FOODNAME"}).set_index("FOODNAME") #pandas.merge(how='left', right=names, left=part, left_on="FOODNAME", right_on="THSCODE") pandas.merge(how='left', right=names, left=part, left_index=True, right_index = True).set_index("DESCRIPT") #names ```
BROCCOLI KALE TWIX MILKSHAKE
DESCRIPT
energy,calculated -0.014435 0.008692 2.939874 0.280767
fat, total 0.026593 0.046368 1.644287 0.140263
carbohydrate, available -0.296969 -0.291666 2.667469 0.195103
protein, total 0.403420 0.255376 0.251675 0.276349
alcohol 0.000000 0.000000 0.000000 0.000000
organic acids, total 0.100330 -0.328305 -0.446863 -0.122195
sugar alcohols 0.000000 0.000000 0.000000 0.000000
sugars, total -0.572030 -0.562058 4.221419 0.366879
fructose -0.299162 -0.260447 0.716580 -0.457189
galactose -0.131067 -0.131067 -0.156703 2.268470
glucose -0.277072 -0.277072 1.086415 0.312071
lactose 0.000000 0.000000 0.495146 0.566464
maltose 0.056397 0.056397 0.479377 0.028199
sucrose -0.539437 -0.539437 4.571218 0.135884
starch, total 0.008880 0.008880 0.518603 0.000000
fibre, total 0.590525 1.290406 0.171471 -0.174970
fibre, water-insoluble 0.490057 0.285866 0.314453 -0.163352
polysaccharides, non-cellulosic, water-soluble 0.572664 0.572664 0.155437 -0.163618
folate, total 1.546457 1.643241 0.034257 0.031612
niacin equivalents, total 0.319264 0.188485 0.116861 0.136289
niacin, preformed (nicotinic acid + nicotinamide) 0.241677 0.281957 -0.039877 -0.079753
vitamers pyridoxine (hydrochloride) 0.224716 1.303355 -0.134830 -0.044943
riboflavine 0.657697 1.176932 0.169617 0.636928
thiamin (vitamin B1) 0.369891 0.475574 -0.065524 0.030120
vitamin A retinol activity equivalents 0.080351 0.734835 0.025000 0.015968
carotenoids, total 1.107892 16.470065 -0.005260 -0.018312
vitamin B-12 (cobalamin) 0.000000 0.000000 0.011532 0.087641
vitamin C (ascorbic acid) 2.999856 3.037288 -0.475378 -0.419766
vitamin D 0.000000 0.000000 0.004403 0.004403
vitamin E alphatocopherol 0.244547 1.993427 0.420917 0.007411
vitamin K, total 2.221502 12.518243 0.049457 0.007459
calcium 0.095603 1.051637 0.113176 0.487941
iron, total 0.207805 0.221893 0.202170 -0.016906
iodide (iodine) -0.018761 -0.018761 -0.017968 -0.016768
potassium 0.469394 0.678014 -0.061256 -0.120452
magnesium 0.197976 0.395951 0.327828 0.044940
salt -0.012421 -0.018237 0.108501 0.012049
phosphorus 0.361836 0.200401 0.228235 0.406369
selenium, total 0.177199 0.177199 0.027749 0.046444
zinc 0.390879 0.188201 0.319218 0.290988
fatty acids, total 0.013528 0.036739 1.618932 0.141260
fatty acids, total polyunsaturated 0.036390 0.107449 0.432253 -0.001229
fatty acids, total monounsaturated cis 0.002320 0.003099 1.345681 0.080810
fatty acids, total saturated 0.004420 0.009818 2.052560 0.233120
fatty acids, total trans 0.000000 0.000000 0.241665 0.147224
fatty acids, total n-3 polyunsaturated 0.073295 0.236547 0.023455 -0.017662
fatty acids, total n-6 polyunsaturated 0.009726 0.033363 0.535066 0.002518
fatty acid 18:2 cis,cis n-6 (linoleic acid) 0.010242 0.035108 0.560131 0.001131
fatty acid 18:3 n-3 (alpha-linolenic acid) 0.077831 0.251260 0.024947 -0.018711
fatty acid 20:5 n-3 (EPA) 0.000000 0.000000 0.000000 0.000000
fatty acid 22:6 n-3 (DHA) 0.000000 0.000000 0.000000 0.000000
cholesterol (GC) 0.000000 0.000000 0.107896 0.071424
sterols, total 0.197817 0.039677 0.176732 -0.010203
tryptophan 0.256079 0.843467 0.256079 0.342292
I hope the simple analysis above shows and justifies my stance that fruits and vegetables grouped together in an interventional study is a horrible idea. Whatever te outcome, it will say very little about either fruit nor vegetables. Note that in this analysis I didn't put any weight on any of the nutrients other than the data set did by grouping or not grouping nutrients together. Also the comparison is based on a per unit of weight basis. The results on a per calory basis are different but the same. Different in that other groups turn up as closer to fruit than vegetables, but the same in that fruits and vegetables turn out very much different and more different than many other obviously unrelated food groups in this data set. As I didn't want to make this blog post longet hant it already is, I ommitted the per kcal variant. As you might have noticed, I am more comfortable with data than I am with biochemistry, so there might be major issues with analyzing the distance between different foods in the way that I did above. I'm here to learn, so if there are fundamental flaws with this way of looking at the data, please drop me a comment, or let me know on [Twitter](https://twitter.com/EngineerDiet).
Originally posted here: https://steemit.com/steemstem/@pibara/why-grouping-fruit-and-vegies-together-in-an-interventional-study-is-probably-a-bad-idea

No comments:

Post a Comment