{"id":117,"date":"2014-04-12T04:02:43","date_gmt":"2014-04-12T04:02:43","guid":{"rendered":"https:\/\/krushton.com\/blog\/?p=117"},"modified":"2023-05-11T23:44:43","modified_gmt":"2023-05-11T23:44:43","slug":"a-statistical-analysis-of-rprogresspics","status":"publish","type":"post","link":"https:\/\/krushton.com\/blog\/a-statistical-analysis-of-rprogresspics\/","title":{"rendered":"Visualizing Reddit Data in IPython Notebook"},"content":{"rendered":"<div class=\"entry-content\">\n<p>In my last\u00a0semester\u00a0at the I School\u00a0I took an introductory course in\u00a0data analysis using Python. I was pretty\u00a0unfamiliar with statistics prior to the course and am still very much an amateur data scientist, but the course gave me just enough skills to be\u00a0brave (perhaps foolishly so) in the face of an unruly data set.<\/p>\n<p>Since the end of the semester I\u2019ve found myself going back to these tools frequently, because they enable me to actually follow up on the random questions\/curiosities\/whims I get from time to time using public datasets and a few lines of code.<\/p>\n<p><strong>What is r\/ProgressPics?<\/strong><\/p>\n<p>For this post I pulled data from a subreddit called <a href=\"https:\/\/web.archive.org\/web\/20150506105427\/http:\/\/reddit.com\/r\/progresspics\">Progress Pics<\/a>. Progress pics is a place where people who are working on some form of body transformation go to post before and after pictures of themselves. I\u2019ve been trying to improve my fitness lately so I&#8217;ve been frequenting the sub on and off for about six months.<\/p>\n<p>It caught my eye as a potential source of interesting\u00a0analysis because, unlike most subreddits, the community at progress pics has a title post format that encourages the use of\u00a0structured data<strong>.<\/strong>\u00a0Posters who are providing pictures are instructed to include their gender, age, height, start weight, and end weight in the post title, in a format that looks something like:\u00a0<strong>F\/23\/5\u20195\u2033 [189lbs &gt; 169lbs = 20lbs].\u00a0<\/strong><\/p>\n<p>The more structure you add to a blob of text, the easier it is to understand programmatically,\u00a0so this data seemed like a good opportunity for analysis.<\/p>\n<p><strong>Method<\/strong><\/p>\n<p>This analysis was done in\u00a0IPython notebook, using\u00a0data analysis packages Pandas, numpy, and matplotlib. I used the <a href=\"https:\/\/web.archive.org\/web\/20150506105427\/http:\/\/www.reddit.com\/dev\/api\">Reddit\u00a0API<\/a> to pull in as many posts from the subreddit as I could before the API complained, and ended up with a dataset of about 1600 posts. The Reddit API provided post metadata including title, number of comments, number of votes, date\/time, and more. Then, after some initial data cleanup, I used a series of regular expressions to extract the poster\u2019s gender, age, height, start weight, and end weight from the post title.<\/p>\n<p>After being filtered through these regexes, about half \u00a0(863) of the posts had good data for all of these metrics, so I dropped the remaining ones as this is a sufficiently large sample.<\/p>\n<p><strong>About the Data Set<\/strong><\/p>\n<p>So, who are the posters of r\/ProgressPics? Some quick facts:<\/p>\n<ul>\n<li><strong>Gender<\/strong>: 362 female (42%), 501\u00a0male (58%)<\/li>\n<li><strong>Age<\/strong>: Average\u00a0age 24, range 15-54<\/li>\n<li><strong>Height<\/strong>: Average\u00a0height 5\u20196\u2033 , range 4\u201911\u2033 \u2013 6\u20198\u2033<\/li>\n<li><strong>Pounds change (lost or gained)<\/strong>: Average\u00a047 pounds, range 0-215<\/li>\n<\/ul>\n<p><strong>Descriptive Statistics<\/strong><\/p>\n<p>Here are some random findings from the data provided by Reddit and scraped from the title:<\/p>\n<\/div>\n<div class=\"entry-content\">\n<ol>\n<li><strong>Almost no one has Reddit Gold<\/strong><br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-171\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/gold.png\" alt=\"gold\" width=\"359\" height=\"271\" \/><\/li>\n<li><strong>About 7% of posts are NSFW<\/strong>. I don\u2019t have a chart of it, but more women than men post NSFW posts.<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-175\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/nsfw.png\" alt=\"nsfw\" width=\"359\" height=\"271\" \/><\/li>\n<li><strong>The vast majority of pictures are posted on Imgur:<\/strong><br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-173\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/imgur.png\" alt=\"imgur\" width=\"359\" height=\"271\" \/><\/li>\n<li>The<strong>\u00a0age demographic<\/strong> is pretty representative of Reddit as a whole:<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-169\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/age_histogram.png\" alt=\"age_histogram\" width=\"426\" height=\"300\" \/><\/li>\n<li>Here we see the <strong>height of posters<\/strong> broken out by <strong>gender<\/strong>. The huge jump is at 72\u2033 or\u00a06\u2032, which probably indicates some fibbing on the part of\u00a0the 5\u201911\u2033 males<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-172\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/height.png\" alt=\"height\" width=\"426\" height=\"300\" \/><\/li>\n<li>This chart shows a histogram of start and end weight which really helps visualize the weight lost!<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-168\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/weight_histogram.png\" alt=\"weight_histogram\" width=\"426\" height=\"300\" \/><\/li>\n<li>For this analysis I was very interested in the influence of gender on voting and commenting behavior. It seemed that female posters get\u00a0way more <strong>votes and comments<\/strong> than male, which is clearly true from the data:<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-165\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/score_comments_gender.png\" alt=\"score_comments_gender\" width=\"426\" height=\"300\" \/><\/li>\n<li>A scatterplot demonstrates the relationship between <strong>gender, scores, and\u00a0pounds lost\/gained<\/strong>. While\u00a0male posters\u00a0hover around the lower range of scores regardless of pounds lost, and some women fare about the same, a select few women climb out of the fray with\u00a01500+ points.This chart also shows how almost no women report\u00a0weight gain.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-166\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/score_scatterplot.png\" alt=\"score_scatterplot\" width=\"426\" height=\"300\" \/><\/li>\n<\/ol>\n<p><strong>\u00a0Significant Relationships<\/strong><\/p>\n<p>What conclusions can we draw out of correlation analysis of this data? With a sample size of 800 the correlation coefficient doesn\u2019t have to be extremely large to be significant.<\/p>\n<p>Assorted\u00a0findings from looking at\u00a0correlations:<\/p>\n<ul>\n<li>Unsurprisingly, there is a strong positive correlation\u00a0between pounds change and score \u2013 in general, people who lose more receive more upvotes and more comments.<\/li>\n<li>There is also a positive relationship between age and pounds change, possibly because older people have put on more weight over time and thus have more to lose.<\/li>\n<li>For men, age is positively correlated with final BMI. Older men are bigger than younger men. There is no relationship for women.<\/li>\n<li>There is a weak correlation between age and number of upvotes for men. Older men receive more votes than younger men.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><b>Gender, Final BMI, and Score<br \/>\n<\/b><\/p>\n<p>One particular area worth exploration is the relationship between gender, final BMI, number of comments, and score.<\/p>\n<p>This was ultimately the question that piqued my interest in this analysis. Anecdotally speaking, it seemed that a certain subset of posters\u00a0were receiving\u00a0an inordinate number of votes\u00a0and comments when compared with the number of pounds lost. In other words, it seemed like people were voting based on the current size of the person posting\u00a0rather than the size of the accomplishment. Furthermore, this effect seemed\u00a0to be particularly strong when the poster was female.<\/p>\n<p>The data supports these conclusions:<\/p>\n<ul>\n<li>While there is a weak inverse correlation between final BMI and score for all posters, this relationship is strong for female posters. In other words, as a female poster\u2019s BMI goes down, the score the post receives goes up.<\/li>\n<li>When looking at voting behavior by final BMI, there is an interesting pattern \u2014 downvotes are highest at the lower end of the bell curve. It turns out that the number of downvotes are also inversely correlated with BMI.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-167\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/upvotes_downvotes.png\" alt=\"upvotes_downvotes\" width=\"426\" height=\"300\" \/><\/li>\n<li>When we group into standard BMI categories, the effect of gender and body size on score are even more striking. For men, posters who are considered normal or overweight receive approximately the same average score. It\u2019s also \u201cbetter\u201d on r\/progresspics for man to be considered obese than underweight. For women the situation couldn\u2019t be more reversed.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-174\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/mean_score_gender.png\" alt=\"mean_score_gender\" width=\"425\" height=\"360\" \/><\/li>\n<li>A final point of interest is the number of comments a post receives. While the number of comments is related to the score, people also tend to comment on things that they don\u2019t like (\u201ccontroversial\u201d posts in Reddit land).\u00a0As the following chart shows, underweight women result in a flurry of comments.<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-170\" src=\"https:\/\/web.archive.org\/web\/20150506105427im_\/https:\/\/krushton.com\/blog\/wp-content\/uploads\/\/2014\/06\/comments_per_post.png\" alt=\"comments_per_post\" width=\"425\" height=\"360\" \/><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><strong>Data<\/strong><\/p>\n<p>If you\u2019d like to perform your own analysis of this data, <a href=\"https:\/\/krushton.com\/progresspics.csv\">click here to download it as a CSV<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In my last\u00a0semester\u00a0at the I School\u00a0I took an introductory course in\u00a0data analysis using Python. I was pretty\u00a0unfamiliar with statistics prior to the course and am still very much an amateur data scientist, but the course gave me just enough skills to be\u00a0brave (perhaps foolishly so) in the face of an unruly data set. Since the &hellip;<\/p>\n","protected":false},"author":1,"featured_media":511,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_vp_format_video_url":"","_vp_image_focal_point":[],"footnotes":""},"categories":[1],"tags":[],"class_list":["post-117","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"wps_subtitle":"Data exploration","_links":{"self":[{"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/posts\/117","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/comments?post=117"}],"version-history":[{"count":10,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/posts\/117\/revisions"}],"predecessor-version":[{"id":513,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/posts\/117\/revisions\/513"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/media\/511"}],"wp:attachment":[{"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/media?parent=117"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/categories?post=117"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/krushton.com\/blog\/wp-json\/wp\/v2\/tags?post=117"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}