xiaodaigh/hd_binning

## hd_binning
High Definition Binning	{#HD_binning}
=====================


The process of binning (or discretization) of variables is a well-established practice in building credit scorecards. The binning process involves taking raw values e.g. income and cutting that data into bins (discrete ranges) such as 2000-3000, 3000-4000. Typically we would see an upward trend in terms of Good/Bad Odds as the income levels go up.

In this blog post I would like to explain a novel approach to binning that can produce very fine binning.

Automatic Binary Binning Algorithm (ABBA)
---------

To understand how High Definition Binning can be done we need to first accept that there exists algorithm that can help automatically bin the raw factors. One such algorithm is the Automatic Binary Binning Algorithm. It can bin the variable subject to certain GD odds trends being satisfied, e.g. upward trend in GB odds after binning.


Bootstrap
---------
High definition binning is basically an application of bootstrapping with the ABBA algorithm. Let's define what bootstrapping is: treat your dataset as the whole universe. I think if you have applied  bootstrapping before you might have a misconception that bootstrapping is about sampling. Actually it  is not. Let me explain. Suppose your dataset consists of only 3 records: call them A, B, and C. If those 3 records were your whole universe, then there are only $$3^3 = 27$$ ways to obtain a set of 3 records from it:

Sample No.      | Sample
--------- | -----
1  | A, A, A
2     | A, A, B
3      | A, A, C
4      | A, B, A
...      | ...
26      | C, C, B
27      | C, C, C

If you compute some summary statistics such as average of the sample's income etc then for each of the possible 27 samples you will end up as a potentially different number. Now the 27 different numbers form a distribution and you can analyse this distribution and make inferences.

Now most datasets contains more than 3 rows, in credit scoring it is common for one to have tens of millions of rows of data to work with. This is where sampling comes in. When you dataset is large it's impossible to enumerate all possible combinations for your dataset, hence you would need to sample with replacement for a large number of times to derive bootstrapped sample.

# High definition binning
In credit scoring each binning is basically a step function that map a raw value to a WOE. So it would look something like this.

http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous


### Definition Lists

**Markdown Extra** has a special syntax for definition lists too:

Term 1
Term 2
:   Definition A
:   Definition B

Term 3

:   Definition C

:   Definition D

	> part of definition D


### Fenced code blocks

GitHub's fenced code blocks[^gfm] are also supported with **Prettify** syntax highlighting:

```
// Foo
var bar = 0;
```

> **Tip:** To use **Highlight.js** instead of **Prettify**, just configure the `Markdown Extra` extension in the <i class="icon-cog"></i> `Settings` dialog.


### Footnotes

You can create footnotes like this[^footnote].

  [^footnote]: Here is the *text* of the **footnote**.


### SmartyPants

SmartyPants converts ASCII punctuation characters into "smart" typographic punctuation HTML entities. For example:

|                  | ASCII                                    | HTML                                |
 ------------------|------------------------------------------|-------------------------------------
| Single backticks | `'Isn't this fun?'`                      | &#8216;Isn&#8217;t this fun?&#8217; |
| Quotes           | `"Isn't this fun?"`                      | &#8220;Isn&#8217;t this fun?&#8221; |
| Dashes           | `-- is an en-dash and --- is an em-dash` | &#8211; is an en-dash and &#8212; is an em-dash |


### Table of contents

You can insert a table of contents using the marker `[TOC]`:

[TOC]


### Comments

Usually, comments in Markdown are just standard HTML comments. <!-- like this -->
**StackEdit** extends HTML comments in order to produce useful, highlighted comments in the preview but not in your exported documents. <!--- This is very useful for collecting feedback in a collaborative document. -->


### MathJax

You can render *LaTeX* mathematical expressions using **MathJax**, as on [math.stackexchange.com][1]:

The *Gamma function* satisfying $\Gamma(n) = (n-1)!\quad\forall
n\in\mathbb N$ is via the Euler integral

$$
\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,.
$$

> **Tip:** Make sure you include MathJax into your publications to render mathematical expression correctly. Your page/template should include something like:

```
<script type="text/javascript" src="https://stackedit.io/libs/MathJax/MathJax.js?config=TeX-AMS_HTML"></script>
```

> **NOTE:** You can find more information:
>
> - about **Markdown** syntax [here][2],
> - about **Markdown Extra** extension [here][3],
> - about **LaTeX** mathematical expressions [here][4],
> - about **Prettify** syntax highlighting [here][5],
> - about **Highlight.js** syntax highlighting [here][6].

  [^stackedit]: [StackEdit](https://stackedit.io/) is a full-featured, open-source Markdown editor based on PageDown, the Markdown library used by Stack Overflow and the other Stack Exchange sites.

  [^gfm]: **GitHub Flavored Markdown** (GFM) is supported by StackEdit.


  [1]: http://math.stackexchange.com/
  [2]: http://daringfireball.net/projects/markdown/syntax "Markdown"
  [3]: https://github.com/jmcmanus/pagedown-extra "Pagedown Extra"
  [4]: http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference
  [5]: https://code.google.com/p/google-code-prettify/
  [6]: http://highlightjs.org/
	High Definition Binning {#HD_binning}
	=====================


	The process of binning (or discretization) of variables is a well-established practice in building credit scorecards. The binning process involves taking raw values e.g. income and cutting that data into bins (discrete ranges) such as 2000-3000, 3000-4000. Typically we would see an upward trend in terms of Good/Bad Odds as the income levels go up.

	In this blog post I would like to explain a novel approach to binning that can produce very fine binning.

	Automatic Binary Binning Algorithm (ABBA)
	---------

	To understand how High Definition Binning can be done we need to first accept that there exists algorithm that can help automatically bin the raw factors. One such algorithm is the Automatic Binary Binning Algorithm. It can bin the variable subject to certain GD odds trends being satisfied, e.g. upward trend in GB odds after binning.


	Bootstrap
	---------
	High definition binning is basically an application of bootstrapping with the ABBA algorithm. Let's define what bootstrapping is: treat your dataset as the whole universe. I think if you have applied bootstrapping before you might have a misconception that bootstrapping is about sampling. Actually it is not. Let me explain. Suppose your dataset consists of only 3 records: call them A, B, and C. If those 3 records were your whole universe, then there are only $$3^3 = 27$$ ways to obtain a set of 3 records from it:

	Sample No. \| Sample
	--------- \| -----
	1 \| A, A, A
	2 \| A, A, B
	3 \| A, A, C
	4 \| A, B, A
	... \| ...
	26 \| C, C, B
	27 \| C, C, C

	If you compute some summary statistics such as average of the sample's income etc then for each of the possible 27 samples you will end up as a potentially different number. Now the 27 different numbers form a distribution and you can analyse this distribution and make inferences.

	Now most datasets contains more than 3 rows, in credit scoring it is common for one to have tens of millions of rows of data to work with. This is where sampling comes in. When you dataset is large it's impossible to enumerate all possible combinations for your dataset, hence you would need to sample with replacement for a large number of times to derive bootstrapped sample.

	# High definition binning
	In credit scoring each binning is basically a step function that map a raw value to a WOE. So it would look something like this.

	http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous


	### Definition Lists

	Markdown Extra has a special syntax for definition lists too:

	Term 1
	Term 2
	: Definition A
	: Definition B

	Term 3

	: Definition C

	: Definition D

	> part of definition D


	### Fenced code blocks

	GitHub's fenced code blocks[^gfm] are also supported with Prettify syntax highlighting:

	```
	// Foo
	var bar = 0;
	```

	> Tip: To use Highlight.js instead of Prettify, just configure the `Markdown Extra` extension in the <i class="icon-cog"></i> `Settings` dialog.


	### Footnotes

	You can create footnotes like this[^footnote].

	[^footnote]: Here is the text of the footnote.


	### SmartyPants

	SmartyPants converts ASCII punctuation characters into "smart" typographic punctuation HTML entities. For example:

	\| \| ASCII \| HTML \|
	------------------\|------------------------------------------\|-------------------------------------
	\| Single backticks \| `'Isn't this fun?'` \| ‘Isn’t this fun?’ \|
	\| Quotes \| `"Isn't this fun?"` \| “Isn’t this fun?” \|
	\| Dashes \| `-- is an en-dash and --- is an em-dash` \| – is an en-dash and — is an em-dash \|


	### Table of contents

	You can insert a table of contents using the marker `[TOC]`:

	[TOC]


	### Comments

	Usually, comments in Markdown are just standard HTML comments. <!-- like this -->
	StackEdit extends HTML comments in order to produce useful, highlighted comments in the preview but not in your exported documents. <!--- This is very useful for collecting feedback in a collaborative document. -->


	### MathJax

	You can render LaTeX mathematical expressions using MathJax, as on [math.stackexchange.com][1]:

	The Gamma function satisfying $\Gamma(n) = (n-1)!\quad\forall
	n\in\mathbb N$ is via the Euler integral

	$$
	\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt\,.
	$$

	> Tip: Make sure you include MathJax into your publications to render mathematical expression correctly. Your page/template should include something like:

	```
	<script type="text/javascript" src="https://stackedit.io/libs/MathJax/MathJax.js?config=TeX-AMS_HTML"></script>
	```

	> NOTE: You can find more information:
	>
	> - about Markdown syntax [here][2],
	> - about Markdown Extra extension [here][3],
	> - about LaTeX mathematical expressions [here][4],
	> - about Prettify syntax highlighting [here][5],
	> - about Highlight.js syntax highlighting [here][6].

	[^stackedit]: [StackEdit](https://stackedit.io/) is a full-featured, open-source Markdown editor based on PageDown, the Markdown library used by Stack Overflow and the other Stack Exchange sites.

	[^gfm]: GitHub Flavored Markdown (GFM) is supported by StackEdit.


	[1]: http://math.stackexchange.com/
	[2]: http://daringfireball.net/projects/markdown/syntax "Markdown"
	[3]: https://github.com/jmcmanus/pagedown-extra "Pagedown Extra"
	[4]: http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference
	[5]: https://code.google.com/p/google-code-prettify/
	[6]: http://highlightjs.org/