{"id":1095,"date":"2013-09-12T19:37:16","date_gmt":"2013-09-12T19:37:16","guid":{"rendered":"http:\/\/www.philippeadjiman.com\/blog\/?p=1095"},"modified":"2025-07-18T13:47:03","modified_gmt":"2025-07-18T13:47:03","slug":"a-data-science-exploration-from-the-titanic-in-r","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2013\/09\/12\/a-data-science-exploration-from-the-titanic-in-r\/","title":{"rendered":"A Data Science Exploration From the Titanic in R"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/miro.medium.com\/v2\/resize%3Afit%3A1134\/0%2Aqkt5M5Lhyb0oRie4.png?ssl=1\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"http:\/\/www.kaggle.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Kaggle<\/a> offered this year a knowledge competition called &#8220;<a href=\"http:\/\/www.kaggle.com\/c\/titanic-gettingStarted\" target=\"_blank\" rel=\"noreferrer noopener\">Titanic: Machine Learning from Disaster<\/a>&#8221; exposing a popular&nbsp;&#8220;toy-yet-interesting&#8221; data set around the Titanic. The goal is to &nbsp;predict as accurately as possible the survival of the titanic&#8217;s passengers based on their characteristics (age, sex, ticket fare etc&#8230;).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In that post, we&#8217;ll&nbsp;use that data set in order to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">Illustrate through a comprehensive example a set of useful tools\/packages to do some predictive modelling from the&nbsp;<a href=\"http:\/\/www.r-project.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">R<\/a>&nbsp;statistical framework.<\/li>\n\n\n\n<li class=\"\">Take the opportunity of the example to illustrate the process and kind of tricks that it takes to improve\/tune a predictive model.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><br>The whole code creating all the plots\/stats and models exposed in that post and also building an output reaching a score 0.79426 on the leaderboard can be found on github <a href=\"https:\/\/github.com\/padjiman\/titanic-blog-post-code\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>&nbsp; or on Rpubs <a href=\"http:\/\/rpubs.com\/padjiman\/8419\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>&nbsp;(built with Knit HTML from <a href=\"http:\/\/www.rstudio.com\/ide\/\" target=\"_blank\" rel=\"noreferrer noopener\">R studio<\/a>&nbsp;).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Preliminaries<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, download the test and training set from the <a href=\"http:\/\/www.kaggle.com\/c\/titanic-gettingStarted\/data\" target=\"_blank\" rel=\"noreferrer noopener\">data page<\/a> of the competition (here is a <a href=\"http:\/\/www.philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/titanic.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> of the two small files in case the page from kaggle is removed in the future).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once you loaded the dataset into a data frame, you can do some data analysis\/explorations. &nbsp;Even though that part is critical to start playing and feeling the data,&nbsp;I won&#8217;t go into details because there already were blog posts written about that, in particular <a href=\"http:\/\/www.mattdelhey.com\/2013\/01\/10\/kaggle-excel-in-r\/\" target=\"_blank\" rel=\"noreferrer noopener\">that one<\/a> is a very nice R version of the <a href=\"http:\/\/www.kaggle.com\/c\/titanic-gettingStarted\/details\/getting-started-with-excel\" target=\"_blank\" rel=\"noreferrer noopener\">getting started with excel<\/a> data exploration tutorial on Kaggle&#8217;s website.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, i&#8217;ll just illustrate a nice simple and effective way of observing one important aspect of the data: missing values.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <a href=\"http:\/\/cran.r-project.org\/web\/packages\/Amelia\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">Amelia<\/a>&nbsp;R package is a toolbox around missing values, in particular for performing&nbsp;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Imputation_(statistics)\" target=\"_blank\" rel=\"noreferrer noopener\">imputation<\/a> of the missing data. Getting a visual and global insight about missing data in the test and train set is as simple as that:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">library(Amelia)\n#... code for loading test and train data in a data frame\nmissmap(rawdata, main = \"Missingness Map Train\")\nmissmap(test, main = \"Missingness Map Test\")<\/pre>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1001\" height=\"1024\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/image-7.png?resize=1001%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-1818\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/image-7.png?resize=1001%2C1024&amp;ssl=1 1001w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/image-7.png?resize=293%2C300&amp;ssl=1 293w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/image-7.png?resize=768%2C785&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/image-7.png?resize=1502%2C1536&amp;ssl=1 1502w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/image-7.png?w=1582&amp;ssl=1 1582w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><figcaption class=\"wp-element-caption\">Missingness Maps<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">From those maps, you can immediately observe that only the <em><strong>age<\/strong><\/em> feature is badly suffering from missing data. Considering how small is the training set, you can hardly just ignore records having a missing age. We&#8217;ll see later in the post what kind of strategy we can use to deal with that issue.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Building\/Tuning models with Caret<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <a href=\"http:\/\/cran.r-project.org\/web\/packages\/caret\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">caret<\/a> package is a kind of toolbox for homogenising the many existing R packages for classification and regression and also provide out of the box a standard way to perform common tasks like model parameters tuning and more. Also, the author (<a href=\"http:\/\/www.linkedin.com\/pub\/max-kuhn\/10\/a91\/864\" target=\"_blank\" rel=\"noreferrer noopener\">Max Khun<\/a>) did an amazing job at documenting the package in the vignettes (<a href=\"http:\/\/cran.r-project.org\/web\/packages\/caret\/vignettes\/caret.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a> or <a href=\"http:\/\/www.jstatsoft.org\/v28\/i05\/paper\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a> for a longer but older version) and on the <a href=\"http:\/\/caret.r-forge.r-project.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">package dedicated website<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is a snippet of code where i successively train a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Random_forest\" target=\"_blank\" rel=\"noreferrer noopener\">random forest<\/a> and a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Gradient_boosting\" target=\"_blank\" rel=\"noreferrer noopener\">gradient boosting<\/a> machine (GBM) using the same train function from caret.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>forest.model1 &lt;- train(survived ~ pclass + sex + title + sibsp +parch , data.train, importance=TRUE)\n\nfitControl &lt;- trainControl(\n## 10-fold CV\nmethod = \"repeatedcv\",number = 10,\n## repeated ten times\nrepeats = 10)\n\ngbm.model2 &lt;- train(survived ~ pclass + sex + title + sibsp +parch , data.train, distribution = \"gaussian\", method = \"gbm\", trControl = fitControl, verbose = FALSE)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We&#8217;ll discuss later the features used in the formula but note the fitControl parameter which is passed in the call for training the GBM. This parameter allows to completely define the way the model parameters will be tuned. In that example, the model parameters of the GBM (namely <em>interaction.depth<\/em>, <em>n.trees<\/em> and <em>shrinkage<\/em>, see output below) were compared using a repeated 10-fold cross validation with accuracy being the metric for comparison, but everything is tuneable for that purpose (you can even pass a grid of specific values to compare for each model parameter).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"983\" height=\"1024\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-04-at-13.27.52.png?resize=983%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-1830\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-04-at-13.27.52.png?resize=983%2C1024&amp;ssl=1 983w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-04-at-13.27.52.png?resize=288%2C300&amp;ssl=1 288w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-04-at-13.27.52.png?resize=768%2C800&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-04-at-13.27.52.png?w=1158&amp;ssl=1 1158w\" sizes=\"auto, (max-width: 983px) 100vw, 983px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Also, you can easily visualize variable importance (you need to specify importance=TRUE in the train function, as we did,&nbsp;for having it):<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"698\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/varimp-1024x698-1.png?resize=1024%2C698&#038;ssl=1\" alt=\"\" class=\"wp-image-1820\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/varimp-1024x698-1.png?w=1024&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/varimp-1024x698-1.png?resize=300%2C204&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/varimp-1024x698-1.png?resize=768%2C524&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><figcaption class=\"wp-element-caption\">Variable Importance<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can observe that the variable value with the most importance is the title Mr . The interesting part is that the feature &#8220;title&#8221; was not initially in the data set and was artificially created (we&#8217;ll detail a bit more about it later in the post). But overall, caret offers a very nice framework for easy models comparison and tuning with proper\/uniform built-in cross-validation routines.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One thing though that is so true and said in perfect way in this must-watch <a href=\"http:\/\/www.youtube.com\/watch?v=bL4b1sGnILU\" target=\"_blank\" rel=\"noreferrer noopener\">killer talk<\/a>: &#8220;Don&#8217;t get stuck in algorithm land! Focus on putting better data in the algorithm&#8221;. We&#8217;ll see an example illustrating that later in the post.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Pick the best threshold for your classifier using ROC curves<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most classifiers usually output the probability of an example belonging to a specific class (here &#8216;survived&#8217; or &#8216;died&#8217;). When the only matter is to optimise accuracy (as it is usually the case in competitions), it is useful to pick the optimal threshold\/cutoff for assigning one class or the other.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ROC curves can be used for that and also to assess the robustness of your model. If you&#8217;ve never heard about ROC curves, <a href=\"http:\/\/horicky.blogspot.co.il\/2011\/03\/compare-machine-learning-models-with.html\" target=\"_blank\" rel=\"noreferrer noopener\">this article<\/a>&nbsp;gives the basic intuition and <a href=\"http:\/\/people.inf.elte.hu\/kiss\/13dwhdm\/roc.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">that paper<\/a>&nbsp;goes much more into details while still being crystal clear (i warmly recommend the later if you&#8217;re interested in the subject). For a standalone very clear example in R, this <a href=\"http:\/\/mkseo.pe.kr\/stats\/?p=790\" target=\"_blank\" rel=\"noreferrer noopener\">post<\/a> is what you need (the code below is inspired from it).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<a href=\"http:\/\/cran.r-project.org\/web\/packages\/pROC\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">pROC<\/a>&nbsp;package allows to easily analyse and display ROC curves. Here, we&#8217;re interested in the threshold corresponding to the top left corner of the curve maximising <a href=\"http:\/\/en.wikipedia.org\/wiki\/Sensitivity_and_specificity\" target=\"_blank\" rel=\"noreferrer noopener\">sensitivity and specificity<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#code inspired from http:\/\/mkseo.pe.kr\/stats\/?p=790\nresult.predicted.prob.model1 &lt;- predict(forest.model1, data.test, type=\"prob\")\nresult.roc.model1 &lt;-  roc(data.test$survived, result.predicted.prob.model1$yes)\nplot(result.roc.model1, print.thres=\"best\", print.thres.best.method=\"closest.topleft\")\n\nresult.coords.model1 &lt;- coords(  result.roc.model1, \"best\", best.method=\"closest.topleft\",\n                          ret=c(\"threshold\", \"accuracy\"))\nresult.coords.model1<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Which will output both a graph:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"979\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.48.41.png?resize=1024%2C979&#038;ssl=1\" alt=\"\" class=\"wp-image-1824\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.48.41.png?resize=1024%2C979&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.48.41.png?resize=300%2C287&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.48.41.png?resize=768%2C734&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.48.41.png?w=1270&amp;ssl=1 1270w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><figcaption class=\"wp-element-caption\">ROC curve (click for higher quality)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">and high level information about the curve, e.g. :<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Call:\nroc.default(response = data.test$survived, predictor = result.predicted.prob.model1$yes)\n\nData: result.predicted.prob.model1$yes in 78 controls (data.test$survived yes) &gt; 65 cases (data.test$survived no).\nArea under the curve: 0.931<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Note in particular the Area under the curve (a.k.a AUC) data point which is <a href=\"http:\/\/www.kaggle.com\/c\/whale-detection-challenge\/details\/evaluation\" target=\"_blank\" rel=\"noreferrer noopener\">sometimes<\/a> used to assess the robustness\/quality of your model, although it has been questioned a lot and often criticised to not be a precise\/useful classification performance measure (a small discussion around it can be found <a href=\"http:\/\/riceanalytics.com\/db3\/00232\/riceanalytics.com\/_download\/Is%20the%20AUC%20the%20Best%20Measure.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>). In other words, you&#8217;re often better off relying on your K-fold&nbsp;cross validation measures to assess your out-of-sample performance (c.f. the previous section on caret).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Tweak and tricks<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;ve hinted earlier that the number of missing ages was too high and the training set too small to just ignore the records having a missing age. At least for me, any attempt to impute the missing ages (either in naive or more sophisticated ways) didn&#8217;t lead to any significant accuracy improvement on the 10-fold cross validation test.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Turns out that extracting the title (i.e. Mr or Mrs. etc&#8230;) in the Name attribute of the data set did lead to an improvement (from the competition&#8217;s&nbsp;<a href=\"http:\/\/www.kaggle.com\/c\/titanic-gettingStarted\/forums\/\" target=\"_blank\" rel=\"noreferrer noopener\">forums<\/a>, i saw that few people used that feature as well). Let&#8217;s have a look at the age distributions per extracted title in the training set (some rare occurrences of titles were aggregated into larger titles, e.g. &#8220;Capt&#8221;, &#8220;Col&#8221;, &#8220;Major&#8221;,&#8221;Sir&#8221;, &#8220;Don&#8221;,&#8221;Dr&#8221; were mapped to &#8220;Mr&#8221;):<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"923\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.49.36.png?resize=1024%2C923&#038;ssl=1\" alt=\"\" class=\"wp-image-1825\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.49.36.png?resize=1024%2C923&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.49.36.png?resize=300%2C271&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.49.36.png?resize=768%2C693&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/Screenshot-2025-07-03-at-13.49.36.png?w=1160&amp;ssl=1 1160w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><figcaption class=\"wp-element-caption\">Age distributions per Title<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This somehow matches the intuition (though I didn&#8217;t know that in apparently old\/traditional english, &#8220;Master&#8221; denotes a young\/unmarried man). And it also makes sense intuitively that Title is a good proxy for the too many missing ages, allowing for totally ignoring the age feature and thus keep all the data in the training set, without introducing any potential noise with an imputation method.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When i&#8217;ve plugged in this new Title feature into the random forest, i saw an improvement from 0.785 to 0.801 on my 10-fold cross validation out-of-sample accuracy estimation, and it was reflected in my submission on the public leaderboard where i jumped to the top 5% best submissions at that time.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"362\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/improve1-1024x362-1.png?resize=1024%2C362&#038;ssl=1\" alt=\"\" class=\"wp-image-1826\" title=\"improve\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/improve1-1024x362-1.png?w=1024&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/improve1-1024x362-1.png?resize=300%2C106&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/improve1-1024x362-1.png?resize=768%2C272&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Note that an improvement on your cross validation is not always reflected on the leaderboard, sometimes even the opposite (c.f &#8220;Lesson One&#8221; from this very cool <a href=\"http:\/\/www.rouli.net\/2013\/02\/five-lessons-from-kaggles-event.html\" target=\"_blank\" rel=\"noreferrer noopener\">blog post<\/a> by <a href=\"https:\/\/twitter.com\/rouli\" target=\"_blank\" rel=\"noreferrer noopener\">@rouli<\/a>, highly recommended). Note also that this particular competition lasts 1 year and was just for learning purpose, so there are thousands and thousands of participants, including not few people who obviously spent useless time to extract the answers from publicly available lists (e.g. <a href=\"http:\/\/lib.stat.cmu.edu\/S\/Harrell\/data\/ascii\/titanic.txt\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a> or&nbsp;<a href=\"http:\/\/titanic100.com\/titanic-passengers-the-list-of-people-on-the-titanic-when-it-sank\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>) to get a near perfect score (though you could use them to know you near real final score on the private leaderboard if you can&#8217;t wait the end of the competition, but still kind of pointless). Finally, more things can be done to try improve the accuracy even more, an obvious one being to combine multiple models together (majority vote is often used in binary\/multi-class settings) but we won&#8217;t cover that in this post.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Conclusion<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We explored on a comprehensive example how R can be used to build and tune quickly robust predictive models which are significantly outperforming&nbsp;the baseline. Of course, it is somehow a toy example but it was interesting enough to explore some important aspects needed when building predictive models. For much bigger data sets (both in terms of training set size and\/or number of features in the data) you might need to introduce different\/additional technical and theoretical tools that we might explore in future posts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, note that a competition settings might be very different than a real production settings. Not only talking about why <a href=\"http:\/\/www.techdirt.com\/blog\/innovation\/articles\/20120409\/03412518422\/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml \" target=\"_blank\" rel=\"noreferrer noopener\">Netflix never implemented the model that won the $1M challenge<\/a>, &nbsp;but also the whole infrastructure that you&#8217;d need to build in order to do big data science at scale on many different problems (Scala is quickly becoming a trend around that, check those <a href=\"http:\/\/www.slideshare.net\/VitalyGordon\/scalable-and-flexible-machine-learning-with-scala-linkedin\" target=\"_blank\" rel=\"noreferrer noopener\">killer slides<\/a>&nbsp;and <a href=\"http:\/\/www.youtube.com\/watch?v=VTn2-VZYpoQ\" target=\"_blank\" rel=\"noreferrer noopener\">talk<\/a> by my friend <a href=\"https:\/\/twitter.com\/BigDataSc\" target=\"_blank\" rel=\"noreferrer noopener\">@BigDataSc<\/a>&nbsp;from LinkedIn and <a href=\"https:\/\/twitter.com\/ccsevers\" target=\"_blank\" rel=\"noreferrer noopener\">@ccservers<\/a>&nbsp;from eBay for more on that&nbsp;).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;ll conclude by citing again this awesome sentence from this <a href=\"http:\/\/www.youtube.com\/watch?v=bL4b1sGnILU\" target=\"_blank\" rel=\"noreferrer noopener\">must-watch talk<\/a> by <a href=\"https:\/\/twitter.com\/nmkridler\" target=\"_blank\" rel=\"noreferrer noopener\">@nmkridler<\/a>&nbsp;: &#8220;Don&#8217;t get stuck in algorithm land! Focus on putting better data in the algorithm&#8221;. I really think that this is what data science is all about.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>References \/ Useful Links<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Full code of the plots\/models exposed in that post: on <a href=\"https:\/\/github.com\/padjiman\/titanic-blog-post-code\" target=\"_blank\" rel=\"noreferrer noopener\">github<\/a> and <a href=\"http:\/\/rpubs.com\/padjiman\/8419\" target=\"_blank\" rel=\"noreferrer noopener\">Rpubs<\/a><\/li>\n\n\n\n<li class=\"\"><a href=\"http:\/\/mattdelhey.com\/blog\/kaggle-getting-started-with-excel-in-r\/\" target=\"_blank\" rel=\"noreferrer noopener\">Kaggle: Getting Started with Excel &#8212; In R<\/a>. Very nice R conversion of kaggle&#8217;s initial explorative analysis of the data set.<\/li>\n\n\n\n<li class=\"\"><a href=\"http:\/\/people.inf.elte.hu\/kiss\/13dwhdm\/roc.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">An introduction to ROC analysis<\/a>. Crystal clear primer if you want to know more around ROC.<\/li>\n\n\n\n<li class=\"\"><a href=\"http:\/\/www.youtube.com\/watch?v=bL4b1sGnILU \" target=\"_blank\" rel=\"noreferrer noopener\">Data Agnosticism: Feature Engineering Without Domain Expertise<\/a>. Must watch talk if you&#8217;re a Kaggler (by&nbsp;<a href=\"https:\/\/twitter.com\/nmkridler\" target=\"_blank\" rel=\"noreferrer noopener\">@nmkridler<\/a>&nbsp;).<\/li>\n\n\n\n<li class=\"\"><a href=\"http:\/\/www.rouli.net\/2013\/02\/five-lessons-from-kaggles-event.html\" target=\"_blank\" rel=\"noreferrer noopener\">Five Lessons from Kaggle&#8217;s Event Recommendation Engine Challenge<\/a>. Same comment as above (by&nbsp;<a href=\"https:\/\/twitter.com\/rouli\" target=\"_blank\" rel=\"noreferrer noopener\">@rouli<\/a>&nbsp;).<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Step aboard the Titanic dataset: Explore, feature-engineer, and model your way to survival predictions with style.<\/p>\n","protected":false},"author":1,"featured_media":1811,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7,12,15,17],"tags":[21,29,33],"class_list":["post-1095","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-machine-learning","category-r","category-tutorial","tag-data-science","tag-machine-learning","tag-r"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2013\/09\/0_qkt5M5Lhyb0oRie4.webp?fit=567%2C328&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1095","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=1095"}],"version-history":[{"count":18,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1095\/revisions"}],"predecessor-version":[{"id":1958,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1095\/revisions\/1958"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1811"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=1095"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=1095"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=1095"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}