Skip to content

Instantly share code, notes, and snippets.

@xiaodaigh
Created April 10, 2014 02:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save xiaodaigh/10339194 to your computer and use it in GitHub Desktop.
Save xiaodaigh/10339194 to your computer and use it in GitHub Desktop.
hd_binning
High Definition Binning {#HD_binning}
=====================
The process of binning (or discretization) of variables is a well-established practice in building credit scorecards. The binning process involves taking raw values e.g. income and cutting that data into bins (discrete ranges) such as 2000-3000, 3000-4000. Typically we would see an upward trend in terms of Good/Bad Odds as the income levels go up.
In this blog post I would like to explain a novel approach to binning that can produce very fine binning.
Automatic Binary Binning Algorithm (ABBA)
---------
To understand how High Definition Binning can be done we need to first accept that there exists algorithm that can help automatically bin the raw factors. One such algorithm is the Automatic Binary Binning Algorithm. It can bin the variable subject to certain GD odds trends being satisfied, e.g. upward trend in GB odds after binning.
Bootstrap
---------
High definition binning is basically an application of bootstrapping with the ABBA algorithm. Let's define what bootstrapping is: treat your dataset as the whole universe. I think if you have applied bootstrapping before you might have a misconception that bootstrapping is about sampling. Actually it is not. Let me explain. Suppose your dataset consists of only 3 records: call them A, B, and C. If those 3 records were your whole universe, then there are only $$3^3 = 27$$ ways to obtain a set of 3 records from it:
Sample No. | Sample
--------- | -----
1 | A, A, A
2 | A, A, B
3 | A, A, C
4 | A, B, A
... | ...
26 | C, C, B
27 | C, C, C
If you compute some summary statistics such as average of the sample's income etc then for each of the possible 27 samples you will end up as a potentially different number. Now the 27 different numbers form a distribution and you can analyse this distribution and make inferences.
Now most datasets contains more than 3 rows, in credit scoring it is common for one to have tens of millions of rows of data to work with. This is where sampling comes in. When you dataset is large it's impossible to enumerate all possible combinations for your dataset, hence you would need to sample with replacement for a large number of times to derive bootstrapped sample.
# High definition binning
In credit scoring each binning is basically a step function that map a raw value to a WOE. So it would look something like this.
http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous
### Definition Lists
**Markdown Extra** has a special syntax for definition lists too:
Term 1
Term 2
: Definition A
: Definition B
Term 3
: Definition C
: Definition D
> part of definition D
### Fenced code blocks
GitHub's fenced code blocks[^gfm] are also supported with **Prettify** syntax highlighting:
```
// Foo
var bar = 0;
```
> **Tip:** To use **Highlight.js** instead of **Prettify**, just configure the `Markdown Extra` extension in the <i class="icon-cog"></i> `Settings` dialog.
### Footnotes
You can create footnotes like this[^footnote].
[^footnote]: Here is the *text* of the **footnote**.
### SmartyPants
SmartyPants converts ASCII punctuation characters into "smart" typographic punctuation HTML entities. For example:
| | ASCII | HTML |
------------------|------------------------------------------|-------------------------------------
| Single backticks | `'Isn't this fun?'` | &#8216;Isn&#8217;t this fun?&#8217; |
| Quotes | `"Isn't this fun?"` | &#8220;Isn&#8217;t this fun?&#8221; |
| Dashes | `-- is an en-dash and --- is an em-dash` | &#8211; is an en-dash and &#8212; is an em-dash |
### Table of contents
You can insert a table of contents using the marker `[TOC]`:
[TOC]
### Comments
Usually, comments in Markdown are just standard HTML comments. <!-- like this -->
**StackEdit** extends HTML comments in order to produce useful, highlighted comments in the preview but not in your exported documents. <!--- This is very useful for collecting feedback in a collaborative document. -->
### MathJax
You can render *LaTeX* mathematical expressions using **MathJax**, as on [math.stackexchange.com][1]:
The *Gamma function* satisfying $\Gamma(n) = (n-1)!\quad\forall
n\in\mathbb N$ is via the Euler integral
$$
\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,.
$$
> **Tip:** Make sure you include MathJax into your publications to render mathematical expression correctly. Your page/template should include something like:
```
<script type="text/javascript" src="https://stackedit.io/libs/MathJax/MathJax.js?config=TeX-AMS_HTML"></script>
```
> **NOTE:** You can find more information:
>
> - about **Markdown** syntax [here][2],
> - about **Markdown Extra** extension [here][3],
> - about **LaTeX** mathematical expressions [here][4],
> - about **Prettify** syntax highlighting [here][5],
> - about **Highlight.js** syntax highlighting [here][6].
[^stackedit]: [StackEdit](https://stackedit.io/) is a full-featured, open-source Markdown editor based on PageDown, the Markdown library used by Stack Overflow and the other Stack Exchange sites.
[^gfm]: **GitHub Flavored Markdown** (GFM) is supported by StackEdit.
[1]: http://math.stackexchange.com/
[2]: http://daringfireball.net/projects/markdown/syntax "Markdown"
[3]: https://github.com/jmcmanus/pagedown-extra "Pagedown Extra"
[4]: http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference
[5]: https://code.google.com/p/google-code-prettify/
[6]: http://highlightjs.org/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment