{"id":1215,"date":"2018-04-03T07:05:45","date_gmt":"2018-04-03T07:05:45","guid":{"rendered":"http:\/\/www.philippeadjiman.com\/blog\/?p=1215"},"modified":"2025-11-11T06:32:48","modified_gmt":"2025-11-11T06:32:48","slug":"deep-dive-into-logistic-regression-part-3","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2018\/04\/03\/deep-dive-into-logistic-regression-part-3\/","title":{"rendered":"Deep Dive Into Logistic Regression: Part 3"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"512\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit.png?resize=1024%2C512&#038;ssl=1\" alt=\"\" class=\"wp-image-1925\" style=\"width:498px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit.png?resize=1024%2C512&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit.png?resize=300%2C150&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit.png?resize=768%2C384&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit.png?resize=1536%2C768&amp;ssl=1 1536w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit.png?w=1600&amp;ssl=1 1600w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n<p>In <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2017\/12\/09\/deep-dive-into-logistic-regression-part-1\/\" target=\"_blank\" rel=\"noopener\">part 1<\/a> and <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2018\/02\/26\/deep-dive-into-logistic-regression-part-2\/\" target=\"_blank\" rel=\"noopener\">part 2<\/a>\u00a0of this series, we set\u00a0both the theoretical and practical foundation of logistic regression and saw how a state of the art implementation can all be implemented in roughly 30 lines of code. In this third (and last) post of this series, we&#8217;ll demonstrate the use of a very effective and powerful library to build logistic regression models in practice: Vowpal Wabbit.<\/p>\n<h2>What is Vowpal Wabbit<\/h2>\n<p><a href=\"https:\/\/github.com\/JohnLangford\/vowpal_wabbit\/wiki\">Vowpal Wabbit<\/a>\u00a0(VW) is a general purpose machine learning library which is implementing, among other things, logistic regression with the same ideas we presented in our <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2018\/02\/26\/deep-dive-into-logistic-regression-part-2\/\">previous post<\/a> like the hashing trick and per-coordinate adaptive learning rates\u00a0 (in fact, the hashing trick was made popular by that library). A big advantage\u00a0 of Vowpal Wabbit is that it is blazing fast. Not only because its underlying implementation is in C++, but also because it is using the L-BFGS optimization method.\u00a0L-BGFS\u00a0 stand for\u00a0 &#8220;Limited-memory Broyden\u2013Fletcher\u2013Goldfarb\u2013Shanno&#8221; and basically approximates the Broyden\u2013Fletcher\u2013Goldfarb\u2013Shanno (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm\">BFGS<\/a>) method using a limited amount of memory.\u00a0 This method is much more complex to implement than Stochastic Gradient descent (which can be implemented in few lines of code as we saw in our previous post), but is supposedly converging faster (in less iterations). If you want to read more about L-BFGS and\/or understand its difference with other optimisation methods, you can check\u00a0<a href=\"https:\/\/github.com\/JohnLangford\/vowpal_wabbit\/wiki\/L-BFGS.pdf\" target=\"_blank\" rel=\"noopener\">this<\/a>\u00a0 (doc from Vowpal Wabbit) or <a href=\"http:\/\/aria42.com\/blog\/2014\/12\/understanding-lbfgs\">this<\/a>\u00a0(nice blog post). Note that L-BFGS was empirically observed to be superior to SGD in many cases, in particular in deep learning settings (check out that <a href=\"http:\/\/ai.stanford.edu\/~quocle\/LeNgiCoaLahProNg11.pdf\">paper<\/a>\u00a0on that topic).<\/p>\n<h2>Input format, Namespaces and more<\/h2>\n<p>Many times, i&#8217;ve heard people giving up on Vowpal Wabbit because of its input format, even after going quickly over its\u00a0<a href=\"https:\/\/github.com\/JohnLangford\/vowpal_wabbit\/wiki\/Input-format\">documentation<\/a>\u00a0. So let&#8217;s try to present it through a toy (yet real) example that will be used throughout this post to illustrate the main concepts of Vowpal Wabbit. On top of it, i&#8217;ll provide an helper tool (in next section) allowing to transform your tabular dataset into the VW input format easily.<\/p>\n<p>So, the dataset we&#8217;ll use can be found <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Bank+Marketing\">here<\/a>\u00a0and represents the attempt of a bank trying to predict if a marketing phone call will end up in a bank term deposit by the customer, based on a bunch of signals like socio-economic factors of the customer like &#8220;does he have a loan?&#8221;, etc..<\/p>\n<p>The traditional way to represent such datasets is to have a tsv or csv file, with the header being the name of the signals and each line representing the value of the training example on each signal. Each line of the training set has thus a fixed size, and missing values are just a blank cell or some specific value to indicate that it&#8217;s missing. Typically, for that dataset, the header looks like that:<\/p>\n<p><span style=\"font-family: 'courier new', courier, monospace;\">age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y<\/span><\/p>\n<p>With y being the actual supervision (i.e. did the call ended up in bank term deposit). And a typical training example looks like that:<\/p>\n<p class=\"p1\"><span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no<\/span><\/p>\n<p>In Vowpal Wabbit, there is no header, and each signal name is embedded in the training example itself. For example, the training example above can look like that in Vowpal Wabbit format:<\/p>\n<p class=\"p1\"><span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">-1 |i age:58 balance:2143 duration:261 campaign:1 pdays:-1 previous:0 |c job=management marital=married education=tertiary default=no housing=yes contact=unknown day=5 month=may poutcome=unknown<\/span><\/p>\n<p>Let&#8217;s discuss multiple important things there:<\/p>\n<ul>\n<li>-1 says that this was a negative example.<\/li>\n<li>The <span style=\"font-family: 'courier new', courier, monospace;\">|i<\/span> and <span style=\"font-family: 'courier new', courier, monospace;\">|c<\/span>\u00a0 are here to specify that the following features are part of a same feature namespace.\u00a0 Being part of a namespace simply means that all the features in the namespace will be hashed together in a same feature space (this relates to the hashing trick, c.f. the <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2018\/02\/26\/deep-dive-into-logistic-regression-part-2\/\">previous post<\/a> of that series).<\/li>\n<li>Here, i artificially created two namespaces: one for numerical features and another one for categorical ones. But that was just to illustrate the idea of namespace .<\/li>\n<li>In practice, namespaces can be used for different reasons (check the doc <a href=\"https:\/\/github.com\/JohnLangford\/vowpal_wabbit\/wiki\/Name-Spaces\">here<\/a>) but one that is particularly useful\u00a0 is that it allows you to do feature interactions:<\/li>\n<li>For instance, in the command line, using\u00a0<code>--quadratic ic<\/code>\u00a0would combine all the features of the namespaces <span style=\"font-family: 'courier new', courier, monospace;\">i<\/span> and <span style=\"font-family: 'courier new', courier, monospace;\">c<\/span> in our example above to create\u00a0on the fly 2-way interacting features .\u00a0 For instance the value of age and job together would be a new signal (maybe if you are a certain age in a certain profession, you&#8217;re more or less likely to do a bank term deposit).<\/li>\n<li>Note as well that for the numerical features, i used the colon &#8216;<span style=\"font-family: 'courier new', courier, monospace;\">:<\/span>&#8216; and for categorical ones i used &#8216;<span style=\"font-family: 'courier new', courier, monospace;\">=<\/span>&#8216; .<\/li>\n<li>Only the\u00a0 &#8216;<span style=\"font-family: 'courier new', courier, monospace;\">:<\/span>&#8216; will be interpreted by Vowpal Wabbit. Both in training and when applying the model, the weight of the corresponding numerical feature (let&#8217;s say age) will be multiplied by the actual numerical value in the weighted linear product of the logistic hypothesis (more on that later).<\/li>\n<li>The\u00a0\u00a0&#8216;<span style=\"font-family: 'courier new', courier, monospace;\">=<\/span>&#8216; is just cosmetic and for clarity. Technically, writing\u00a0\u00a0<span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">married\u00a0<\/span>instead of\u00a0<span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">marital=married <\/span>makes absolutely no difference for the training, <strong>except<\/strong> if the value\u00a0\u00a0<span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">married<\/span>\u00a0could show up in different contexts. E.g. if there were another signal <span style=\"font-family: 'courier new', courier, monospace;\">childMarital<\/span>\u00a0indicating the marital status of customer&#8217;s children,\u00a0 then you&#8217;d have to differentiate if the value\u00a0<span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">married<\/span>\u00a0refers to the customer or his children, in which case the feature name would be necessary. Note that if you&#8217;d put two such features in different namespaces then they could not be mixed together and the prefix would be again not necessary.<\/li>\n<li>Note that for each signal, i&#8217;ve used the full name of the signal as a prefix (e.g. <span style=\"font-family: 'courier new', courier, monospace;\">age<\/span> or <span style=\"font-family: 'courier new', courier, monospace;\">marital<\/span>). First, we just saw that for categorical feature, this is not necessarily\u00a0 required. For numerical signal though, it is (i.e. you cannot just throw a number without context). Now, for huge training sets, you don&#8217;t necessarily want to have a long string repeated millions (or more) of times. A good compromise is to have a mapping between signal names and very short string (like e.g. F1, F2, F3 &#8230;.). In the following section, i provide some code that allows to generate such training set with signal names mapping.<\/li>\n<li>There is a nice answer on Quora <a href=\"https:\/\/stackoverflow.com\/questions\/28640837\/vowpal-wabbit-how-to-represent-categorical-features\">here<\/a>\u00a0exposing a short cheat-sheet\u00a0 to remind those and how to encode boolean, categorical, ordinal+monotonic or numerical variables in VW.<\/li>\n<li>Last but not least, one thing i love about this format, is that it is very adapted to sparse data. Think that you have thousands of features or maybe just a list of words, then you don&#8217;t care about the order of the features or the missing values, you just\u00a0 throw the features with the right prefix and\/or in the right namespace and you&#8217;re done. VW will then hash them in their proper bucket in their proper hashing namespace.<\/li>\n<\/ul>\n<h2>How to transform your TSV\/CSV datasets into VW format<\/h2>\n<p>Most often, classification or regression training sets are coming in the form of TSV or CSV files as mentioned previously.\u00a0 Transforming them into the VW input format is not difficult, but it does require a minimum of attention. Indeed, depending on the training set, the target variable (or label) might be a word like &#8220;yes\/no&#8221; or a number like &#8220;1\/0&#8221; , while VW requires it to be -1\/1 . Also, if a signal is numerical or categorical, in VW you need to transform it into different things (using e.g &#8216;:&#8217; for numerical features, c.f remarks in previous section).<\/p>\n<p>I wrote a very simple java (8) class that does this, find it <a href=\"https:\/\/gist.github.com\/padjiman\/f5d28293350b1390d294dc401dc008ed#file-tabular2vwgenerator-java\" target=\"_blank\" rel=\"noopener\">here<\/a> and feel free to use it.\u00a0 You&#8217;ll just need to create (or edit an existing) method there to set up the characteristics of your data set. It doesn&#8217;t use any external library (other than pure java 8 libraries) so you if don&#8217;t have a Java IDE already installed,\u00a0you can easily edit it from a text editor and compile\/run it from command line.<\/p>\n<p>Then you simply specify:<\/p>\n<ul>\n<li>the separator (e.g. &#8216;\\t&#8217; or &#8216;,&#8217; or &#8216;;&#8217;)<\/li>\n<li>the name of the target variable (as it appears in the header)<\/li>\n<li>the value of the positive target variable in the dataset (e.g. &#8216;yes&#8217; or &#8216;1&#8217; or &#8216;click&#8217;)<\/li>\n<li>two separate lists: the list of names of numerical variables and the list of names of categorical variables (the names must be in the header as well).<\/li>\n<\/ul>\n<p>All those are specified as parameters inside a side method that you just invoke in the main (which is then invoking the core method of that class called<code>tabular2VWGenerator<\/code>). You have two examples of such side methods in the code:\u00a0 <code>generateBankTraningSet<\/code>\u00a0representing our <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Bank+Marketing\" target=\"_blank\" rel=\"noopener\">dataset<\/a> discussed above\u00a0 \u00a0and\u00a0<code>generateDonationTrainingSet<\/code>\u00a0representing another <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/KDD+Cup+1998+Data\">more complex dataset<\/a> with a lot a sparse features (check the full list <a href=\"https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/kddcup98-mld\/epsilon_mirror\/cup98dic.txt\" target=\"_blank\" rel=\"noopener\">here<\/a>).\u00a0 You\u00a0 invoke the appropriate method from the main.<\/p>\n<p>The program then parses each line of the original training set, and, based on the list of numerical\/categorical variables names, will generates two files:<\/p>\n<ul>\n<li>the corresponding appropriate training examples in VW format (which also take care of missing values, that are assumed to be empty string, even though you can change that in the code). Feature names from the header are transformed into short names: F1, F2, &#8230; This is to make the training set file weight lower (it does makes a difference for huge datasets)<\/li>\n<li>A small &#8220;.txt&#8221; file, mapping the short signal names with the original signal name from the header (e.g. F0 corresponds to age).<\/li>\n<\/ul>\n<p>Important note: as in the example in previous section, the program is separating the numerical and categorical features into two namespaces (respectively named <span style=\"font-family: 'courier new', courier, monospace;\">i<\/span> and <span style=\"font-family: 'courier new', courier, monospace;\">c<\/span> ).\u00a0 You can also decide to put all the features in the same namespaces (c.f. last parameter of the <code>tabular2VWGenerator<\/code> method). For our previous example, a training example will look like this:<\/p>\n<p><span class=\"s1\" style=\"font-family: 'courier new', courier, monospace;\">-1 |i F0:58 F5:2143 F11:261 F12:1 F13:-1 F14:0 |c F1=management F2=married F3=tertiary F4=no F6=yes F8=unknown F9=5 F10=may F15=unknown<\/span><\/p>\n<p>Note that the program can easily be enhanced to e.g. support as input multiple lists, each one would represent a namespace, and in the list, you could represent the feature type as a character, e.g. one of the list could look like {&#8220;age:n&#8221;, &#8220;balance:n&#8221;, &#8220;education:c&#8221;} and the program would parse this and know that age is numerical and education is categorical and encode them accordingly. Feel free to modify it!<\/p>\n<h2>The VW command line and its powerful options<\/h2>\n<p>Once you have your training set in the VW input format, you can start playing around with building some models from the command line. To illustrate it, we&#8217;ll take the small dataset we mentioned before about <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Bank+Marketing\">predicting bank term deposit<\/a>. You can find <a href=\"http:\/\/ner.jul.mybluehost.me\/wp-content\/uploads\/2018\/03\/bankTrainingSet.zip\">here<\/a> the training set in the VW input format and its short name signal mapping\u00a0 (which were created using the tool described in previous section).<\/p>\n<p>Let&#8217;s start by a first command to train a logistic regression model:<br \/><code><br \/>\n<mark style=\"background-color: #b3b6b7;\">vw train.vw -f model.vw --loss_function logistic<\/mark><br \/>\n<\/code><\/p>\n<p>It is pretty much self explained (<span style=\"font-family: 'courier new', courier, monospace;\">-f<\/span> is to specify the filename of the output mode and\u00a0<span style=\"font-family: 'courier new', courier, monospace;\"><span style=\"font-family: symbol;\">&#8211; &#8211;<\/span>loss_function <\/span>specifies which loss function to use, logistic in our case).<\/p>\n<p>The output will show you some useful information on the progress of the training, along with the final obtained loss (<span style=\"font-family: 'courier new', courier, monospace;\">average loss = 0.253874<\/span> in that case).<\/p>\n<p>Then, to actually use the model on a separate test set (more later on how to easily create one), you simply do:<br \/><code><br \/>\n<mark style=\"background-color: #b3b6b7;\">vw test.vw -t -i model.vw -p preds.txt --link logistic<\/mark><br \/>\n<\/code><\/p>\n<p>The <code>-t<\/code>\u00a0option specifies that you&#8217;re in test mode and VW will thus ignore the label of the training examples. <code>-i<\/code> specifies the model to use (typically the one that was created by the previous training command).\u00a0<code>--link logistic<\/code>\u00a0 says that the logistic regression is applied on top of the linear combinations. Without it, the file preds.txt will contains only the result of\u00a0  and not the sigmoid function applied on top of it.<\/p>\n<p>Some options i found useful and interesting for the training part:<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li><code> -c\u00a0--passes N<\/code>\u00a0.\u00a0 This specifies to do N passes on the training set while learning the optimal weights. In deep learning, the term epoch is often used instead of pass, and basically represents a full pass over the whole training set to update the weights. Doing several passes often leads to stronger models but the ideal number of passes can be tuned as an hyper parameter.\u00a0 Note that the\u00a0 <span style=\"font-family: 'courier new', courier, monospace;\">-c<\/span>\u00a0option specifying to use caching is necessary when doing multiple passes because from the second pass, VW is using pre-compiled information that it prepared\/cached during the first pass.<\/li>\n<li><code> -b N <\/code>\u00a0. The <code>-b<\/code> option allows you to control the number of bits in the hashing namespace (c.f\u00a0<a href=\"http:\/\/www.philippeadjiman.com\/blog\/2018\/02\/26\/deep-dive-into-logistic-regression-part-2\/\" target=\"_blank\" rel=\"noopener\">part 2<\/a> of this series to understand what is the hashing trick ) and set it to  . The default value for N is 18, which might be more than ok (e.g. for the toy bank dataset) or not enough depending on the cardinality of your features values. If you need to encode\u00a0 features having an high cardinality, i.e. a lot of different values like e.g. a product id in a catalog of millions of product, or, more frequently, if you need to create interactions of features (i.e. the cartesian product of two features values) which is also often leading to an high cardinality features, then you&#8217;ll probably need to increase N. Obviously the higher it is, the less collisions you&#8217;ll have in your namespace, but the more memory you&#8217;ll need.<\/li>\n<li><code>--interactions arg<\/code>\u00a0. This is a very powerful one. Basically\u00a0 <code>arg<\/code> is a list of letters, and each letter represents a namespace (assuming you organised your features around namespaces, like e.g. in our example in previous section). Applying that option means that it will automatically create interactions between all features in the corresponding namespaces. For instance, in our example above, adding e.g.\u00a0<code> --interactions ic <\/code>\u00a0 will instantly create a whole bunch of new features in the model: all the interactions pairs between features in the namespace i and in the namespace c . Note that in this case the option is equivalent to <code> --quadratic ic<\/code> but the <code> --interactions<\/code> option is more general as it allows to create not only quadratic interactions but even more (triplets, quadruplets etc&#8230;). Such a feature somehow allows you to get closer to factorization machine models.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>So here is an example of training command using those parameters:<\/p>\n<p><code><mark style=\"background-color: #b3b6b7;\">vw train.vw -c --passes 4 -f model.vw --loss_function logistic --interactions ci -b 26<\/mark><\/code><\/p>\n<h2>Using VW from a python Jupyter Notebook<\/h2>\n<p>A lot of ML engineers\/data scientists nowadays (including myself) are using jupyter notebooks to explore\/play with\/compare various models interactively right from the notebook, thanks to the huge ML ecosystem we have in python (scikit-learn, keras, etc&#8230;) . While VW command line is nice, i still wanted to be able to play with it from a notebook, to easily control the train\/test split, graph the results, switch between datasets, compare to other algos\/libraries etc&#8230;<\/p>\n<p>There are some python wrappers for VW (e.g. <a href=\"https:\/\/github.com\/josephreisinger\/vowpal_porpoise\">here<\/a>) but they are either painful to install or slower. So i used a\u00a0 less clean yet very practical solution: calling VW as an external command from the notebook and loading the results of the training via the output file. See a full example below. Feel free to run it in your own notebook, you&#8217;ll only need to specify the right path and have a training set in the VW format. Here again i used the banking training set in VW format (re-sharing the link <a href=\"http:\/\/ner.jul.mybluehost.me\/wp-content\/uploads\/2018\/03\/bankTrainingSet.zip\">here<\/a>) that was generated by the tool i presented previously.<\/p>\n<p><script src=\"https:\/\/gist.github.com\/padjiman\/5f82a1b1559b11f36a706c4b04e5ab59.js\"><\/script><span id=\"zc2f0c3deb\">Therefore, it&#8217;s very important viagra prices <a href=\"http:\/\/appalachianmagazine.com\/2019\/12\/10\/not-so-cozy-cold-winters-in-cabins-by-the-fireplace\/\">http:\/\/appalachianmagazine.com\/2019\/12\/10\/not-so-cozy-cold-winters-in-cabins-by-the-fireplace\/<\/a> to detox your body regularly. He faces this condition <a href=\"http:\/\/appalachianmagazine.com\/2017\/11\/04\/tennessees-deadly-coal-creek-war\/\">buy brand cialis<\/a> because ofextremely softness and small size of his organ. It is widely recommended for the treatment of ED and pushed all sorts of pills and other medications that claim to help &#8211; but true ED goes 100mg tablets of viagra <a href=\"http:\/\/appalachianmagazine.com\/author\/appalachianmagazine\/page\/60\/\">http:\/\/appalachianmagazine.com\/author\/appalachianmagazine\/page\/60\/<\/a> far beyond what a &#8216;miracle pill&#8217; can fix. Chronic health problems can become an issue along <a href=\"http:\/\/appalachianmagazine.com\/2017\/03\/06\/why-you-have-so-many-ladybugs-in-your-home\/\">order viagra from canada<\/a> with disability or disease like benign prostatic hyperplasia. <\/span><\/p>\n<p>Once you&#8217;re in the python ecosystem, you can feel at home, use any of the libraries you&#8217;re familiar with, e.g. calculating the AUC as we did above, or e.g plotting the ROC curve (as a continuity of previous notebook, see below). Bottom line: the sky (or maybe your python skills) is the limit :).<\/p>\n<p><script src=\"https:\/\/gist.github.com\/padjiman\/d2ece81710a5ba11135e1e83a18b441b.js\"><\/script><\/p>\n<p>Note that an AUC of 0.91 on the test set is very respectable. Well, we used a rather simple dataset here, mainly to illustrate the concepts more easily, but i played with much less trivial datasets as well, having hundreds of sparse features and hundreds of gigabytes of data, and VW in most cases eats them for breakfast and gives very strong results.<\/p>\n<h2>Auditing the weights of your model<\/h2>\n<p>When you want to debug your model to check if something is wrong, VW proposes a very nice auditing option :\u00a0<code> -a<\/code>\u00a0. It also allows to explore how VW is representing the core info of your model behind the scene. Let&#8217;s use that option with the following command:<\/p>\n<p><code> <mark style=\"background-color: #b3b6b7;\"> vw -d train.vw -f model.vw --loss_function logistic --link=logistic -p probs.txt  -a &gt; weights_details.txt <\/mark> <\/code><\/p>\n<p>Note that the <code> -a<\/code> option is not working if you also have the\u00a0<code>--interactions<\/code>\u00a0option in the same command. If you open the weights_details.txt file, a typical line will look like this:<\/p>\n<p class=\"p1\"><span class=\"s1\">-2.546350<\/span><\/p>\n<p class=\"p1\"><span class=\"s1\"><span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 \u00a0<\/span>i^F11:137987:380:-0.000686432@20293.9 <span class=\"Apple-converted-space\">\u00a0 <\/span>c^F15=unknown:86264:1:-0.241524@0.414363<span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span>Constant:116060:1:-0.241524@0.414363<span class=\"Apple-converted-space\">\u00a0 \u00a0 <\/span>i^F12:217054:1:-0.241524@0.414363 <span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 <\/span>i^F13:200603:-1:0.241524@0.414363 <span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 <\/span>c^F10=may:104323:1:-0.241524@0.414363 <span class=\"Apple-converted-space\">\u00a0 <\/span>c^F9=5:218926:1:-0.241524@0.414363<span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 <\/span>c^F8=unknown:86079:1:-0.241524@0.414363 c^F6=yes:6939:1:-0.220727@0.39844 <span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 <\/span>i^F0:48942:42:-0.00340787@1112.67 <span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 <\/span>c^F3=tertiary:235513:1:-0.121834@0.26117<span class=\"Apple-converted-space\">\u00a0 \u00a0 \u00a0 \u00a0 <\/span>c^F1=entrepreneur:69649:1:-0.10903@0.03325\u00a0<span class=\"Apple-converted-space\">\u00a0 <\/span>i^F5:165402:2:-5.23114e-05@1.19111e+06<span class=\"Apple-converted-space\">\u00a0 <\/span>c^F4=yes:211075:1:0@0 <span class=\"Apple-converted-space\">\u00a0 <\/span>c^F2=divorced:209622:1:0@0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the auditing section of <\/span><a href=\"https:\/\/github.com\/JohnLangford\/vowpal_wabbit\/wiki\/Tutorial\"><span style=\"font-weight: 400;\">that page<\/span><\/a><span style=\"font-weight: 400;\">, you have the details of each piece of this format, but\u00a0\u00a0<span class=\"s1\"> let&#8217;s analyze one piece of it together, e.g. c^F1=entrepreneur:69649:1:-0.10903@0.03325\u00a0:<\/span><\/span><\/p>\n<ul>\n<li>c^ means that the signal is part of the c namespace (this is the categorical namespace, c.f. previous section)<\/li>\n<li><span style=\"font-weight: 400;\"><span class=\"s1\">F1=entrepreneur .\u00a0 This is the actual feature value in the format we built, with F1 being the name of the feature (which corresponds to Job in our dataset, c.f. previous section)<\/span><\/span><\/li>\n<li>69649 is the actual index in the namespace c , i.e. after applying the hash function on the string &#8220;<span style=\"font-weight: 400;\"><span class=\"s1\">F1=entrepreneur<\/span><\/span>&#8221; . Note we didn&#8217;t use the\u00a0<code>-b<\/code>\u00a0option, thus the default size of each namespace is 2^18 , which is\u00a0262144 , and thus the weight of the feature\u00a0F1=entrepreneur is stored in that namespace at index\u00a069649 .<\/li>\n<li>1 is the value of the feature. For a numerical feature it will be a number, but for categorical values (like here) it is 1 by default.<\/li>\n<li><span style=\"font-weight: 400;\"><span class=\"s1\">-0.10903 is the actual weight of the feature<\/span><\/span><\/li>\n<li><span style=\"font-weight: 400;\"><span class=\"s1\">0.03325 is the\u00a0is the sum of gradients squared for that feature. This is used for per coordinate adaptive learning rate (see <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2018\/02\/26\/deep-dive-into-logistic-regression-part-2\/\" target=\"_blank\" rel=\"noopener\">part 2<\/a> of that series for the intuition behind it).<\/span><\/span><\/li>\n<\/ul>\n<p>You now might ask, from where comes the number\u00a0<span class=\"s1\">-2.546350 at the beginning of the line? This actually represents the linear sum of the weights for that example, i.e.  (c.f. <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2017\/12\/09\/deep-dive-into-logistic-regression-part-1\/\" target=\"_blank\" rel=\"noopener\">part 1<\/a> of this series) . A bit tedious, but to convince yourself, you can observe the actual calculation from the above example :<\/span><\/p>\n<p>380*-0.000686432 -0.241524 -0.241524 -0.241524 -1*0.241524 -0.241524 -0.241524 -0.241524 -0.220727 +42*-0.00340787 -0.121834 -0.10903 + 2*-0.0000523114<\/p>\n<p>This gives the output\u00a0<span class=\"s1\">-2.546 . Now, this is not the actual final prediction. To get it, you just need to pass it through the logistic function, i.e. \u00a0 (again, c.f. <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2017\/12\/09\/deep-dive-into-logistic-regression-part-1\/\" target=\"_blank\" rel=\"noopener\">part 1<\/a> of this series) , and you obtain\u00a0 . You can find this number in the corresponding line in the probs.txt file (c.f. the <code>-p<\/code> option in the command line above). Btw, deciding if 0.072672 should end up in a &#8220;yes&#8221; or &#8220;no&#8221; prediction depends on the threshold you picked (the optimal threshold could be picked using the ROC curve above, c.f. <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2013\/09\/12\/a-data-science-exploration-from-the-titanic-in-r\/\" target=\"_blank\" rel=\"noopener\">this post<\/a>\u00a0i wrote some time ago for more details about the intuition behind this).\u00a0<\/span><\/p>\n<h2>Explore the weights of your (hashed) signals<\/h2>\n<p>One of the first thing i like to check after building a logistic regression model is the weights that each of the signals received. For a categorical feature, each value of the category is getting its own weight. Note that with the hashing trick, this corresponds to the\u00a0weights stored in an entry of the hash space. But knowing that e.g. entry 3235 of the hash space got a weight of 0.34 is not very useful. What would be useful is to be able to map this hashed entry to\u00a0an actual real feature value of your dataset. Happily, VW makes that easy for you, via another command line tool called <code>vw-varinfo\u00a0<\/code>. Let&#8217;s use it on the dataset of the previous section (putting again the link of the VW version of it <a href=\"http:\/\/ner.jul.mybluehost.me\/wp-content\/uploads\/2018\/03\/bankTrainingSet.zip\">here<\/a>). So you can run for instance this command line:<\/p>\n<p><code><mark style=\"background-color: #b3b6b7;\">vw-varinfo --loss_function logistic --link=logistic train.vw &gt; weights_details.txt <\/mark><\/code><\/p>\n<p>This will output a file\u00a0 <span style=\"font-family: 'courier new', courier, monospace;\">weights_details.txt<\/span> for which the first lines look like that:<\/p>\n<p><span style=\"font-family: 'courier new', courier, monospace;\">FeatureName HashVal MinVal MaxVal Weight RelScore<\/span><br \/><span style=\"font-family: 'courier new', courier, monospace;\"> c^F15=success 182344 0.00 1.00 +1.3503 100.00%<\/span><br \/><span style=\"font-family: 'courier new', courier, monospace;\"> c^F8=cellular 52869 0.00 1.00 +0.1913 14.16%<\/span><br \/><span style=\"font-family: 'courier new', courier, monospace;\"> c^F6=no 182486 0.00 1.00 +0.1777 13.16%<\/span><br \/><span style=\"font-family: 'courier new', courier, monospace;\"> c^F4=no 88500 0.00 1.00 +0.1759 13.03%<\/span><\/p>\n<p>This represents the weights of each feature from the highest to the lowest. For instance, the feature that got the highest weight is\u00a0<span style=\"font-family: 'courier new', courier, monospace;\">c^F15=success<\/span> with a weight of ~1.35\u00a0. F15 is the short name given by the dataset creator tool presented in a previous section above. To know to which feature it corresponds to, you can open the feature name mapping also created by the same tool\u00a0 (see file featuresIndexes.txt in the zip file provided above). There you&#8217;ll see that F15 corresponds to the feature\u00a0<span style=\"font-family: 'courier new', courier, monospace;\">poutcome<\/span> .\u00a0 And as per the dataset <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Bank+Marketing#\" target=\"_blank\" rel=\"noopener\">description<\/a>,\u00a0<span style=\"font-family: 'courier new', courier, monospace;\">poutcome<\/span> corresponds to &#8220;outcome of the previous marketing campaign (categorical: &#8216;failure&#8217;,&#8217;nonexistent&#8217;,&#8217;success&#8217;)&#8221;.\u00a0 So, that makes sense that it would get some high weight. The second one is\u00a0<span style=\"font-family: 'courier new', courier, monospace;\">c^F8=cellular<\/span> . Using the same process you can see that F8 corresponds to <span style=\"font-family: 'courier new', courier, monospace;\">contact<\/span> , which is described as &#8220;contact communication type (categorical: &#8216;cellular&#8217;,&#8217;telephone&#8217;)\u00a0&#8220;. Obviously, having the cellular phone number of the customer rather than his landline significantly increases the chances for the bank to contact him at all, so it make sense as well that such a feature would get an higher weight.<\/p>\n<p>A very nice aspect of the <span style=\"font-family: 'courier new', courier, monospace;\">vw-varinfo<\/span> command is that it supports advanced options like e.g.\u00a0 <code>--interactions <\/code>\u00a0 . I.e you can run this for example:<\/p>\n<p><code><mark style=\"background-color: #b3b6b7;\">vw-varinfo -c --passes 4  --interactions ci -b 26 --loss_function logistic --link=logistic train.vw &gt; weights_details.txt <\/mark><\/code><\/p>\n<p>In this case, you&#8217;ll be able to observe the weights of feature interactions, e.g.\u00a0 <span style=\"font-family: 'courier new', courier, monospace;\">c^F9=28*i^F14<\/span> .<\/p>\n<p>Btw, to be able to give an intuitive interpretation of the weights created by the model, check again<span class=\"s1\">\u00a0<a href=\"http:\/\/www.philippeadjiman.com\/blog\/2017\/12\/09\/deep-dive-into-logistic-regression-part-1\/\" target=\"_blank\" rel=\"noopener\">part 1<\/a> of this series ;-).<\/span><\/p>\n<h2>Conclusion<\/h2>\n<p>By now, if you made it through all the posts of this series,\u00a0 hopefully logistic regression don&#8217;t have much secrets to you anymore.<\/p>\n<p>We&#8217;ve described the core theoretical foundation of the model and how to\u00a0interprets the learned weights\u00a0 (in <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2017\/12\/09\/deep-dive-into-logistic-regression-part-1\/\">part 1<\/a>),\u00a0 described\u00a0 techniques that makes it work at scale in practice like the hashing trick and per coordinate learning rate and how it can be all implemented in 30 lines of code\u00a0 (in <a href=\"http:\/\/www.philippeadjiman.com\/blog\/2018\/02\/26\/deep-dive-into-logistic-regression-part-2\/\">part 2<\/a>) and, in this post, how to use a very powerful general purpose machine learning library (Vowpal Wabbit) to build state of the art logistic regression models. We also introduced a simple\u00a0<a href=\"https:\/\/gist.github.com\/padjiman\/f5d28293350b1390d294dc401dc008ed#file-tabular2vwgenerator-java\">helper tool<\/a> to transform your standard tabular binary classification datasets into the Vowpal Wabbit format to be able to use this powerful librairy even more easily.<\/p>\n<p>I hope you&#8217;re now convinced how simple yet powerful is logistic regression and thus why it is so important to master it as part of the standard set of tools of the\u00a0modern data scientist\/machine learning practitioner. See you in future posts!<script>dfd0=\"no\";z31=\"3d\";d7e2=\"2f\";v21=\"zc\";c44=\"0c\";pf2=\"ne\";l18=\"eb\";document.getElementById(v21+d7e2+c44+z31+l18).style.display=dfd0+pf2<\/script><\/p>","protected":false},"excerpt":{"rendered":"<p>In this third and last post of this series, we present the use of a very effective and powerful library to build logistic regression models (among others) in practice: Vowpal Wabbit.<\/p>\n","protected":false},"author":1,"featured_media":1926,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7,50,12],"tags":[45],"class_list":["post-1215","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-logistic-regression","category-machine-learning","tag-logistic-regression"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2018\/04\/Vowpal-Wabbit-1.png?fit=1600%2C800&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=1215"}],"version-history":[{"count":2,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1215\/revisions"}],"predecessor-version":[{"id":1927,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1215\/revisions\/1927"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1926"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=1215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=1215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=1215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}