Skip to content

Instantly share code, notes, and snippets.

Ref: http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/
Apache Hadoop: Best Practices and Anti-Patterns
Wed August 18, 2010 (Updated)
by Arun C Murthy
5 Comments Bookmark Share
Apache Hadoop is a software framework to build large-scale, shared storage and computing infrastructures. Hadoop clusters are used for a variety of research and development projects, and for a growing number of production processes at Yahoo!, EBay, Facebook, LinkedIn, Twitter, and other companies in the industry. It is a key component in several business critical endeavors representing a very significant investment and technology component. Thus, appropriate usage of the clusters and Hadoop is critical in ensuring that we reap the best possible return on this investment.
This blog post represents compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of aGrid Pattern which, similar to a Design Pattern, represents a general reusable solution for app
@ivanliu
ivanliu / gist:ec227557822b366e9534e85474c46e90
Last active November 28, 2016 00:35
Some links to cool stuff
1. Making a data science blog
https://www.dataquest.io/blog/how-to-setup-a-data-science-blog/
2. 28 Jupyter Notebook tips, tricks and shortcuts
https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
3. Docker: Data Science Environment with Jupyter (DONE)
https://www.dataquest.io/blog/docker-data-science/
4. Unsupervised Learning: Clustering
@ivanliu
ivanliu / data_scraping.prd
Last active December 18, 2016 08:13
Data Scraping
= System Requirements =
+ Finacial Data Store
- Be able to store raw web pages (unstructured data) as well as extracted data (structured data)
- Use MySql for now
+ Crawler
- Crawl selected website and store raw web pages;
- Extract fields of interest from the web page;
- Use Python Scrapy
@ivanliu
ivanliu / chatbot list
Created January 23, 2017 19:34
chatbot
https://www.quora.com/What-is-the-best-API-to-create-a-chatbot-in
https://developer.pandorabots.com/
https://playground.pandorabots.com/en/
https://api.ai/
http://chatscript.sourceforge.net/
https://messengerplatform.fb.com/
http://rebot.me/page/about
@ivanliu
ivanliu / all_pdf.howto
Last active February 13, 2017 08:08
All about PDF
1) Introduction to PDF and good summary for PDF extraction
https://how-to.usopendata.org/en/latest/The-Basics-of-Open-Data/Working-with-PDFs/
2) Zamzar API
Test
100 conversions remaining this month
100 test conversions remaining this month
springforward
24c0966e31d200ef3981dd22def1e7811061bcb4
@ivanliu
ivanliu / website_scrapy_options
Last active January 28, 2018 07:11
How to scrapy website
1. Spynner
https://github.com/makinacorpus/spynner
a) Install libpng
http://ethan.tira-thompson.com/Mac_OS_X_Ports.html
b)
2. Mechanize
http://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize/
http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet
from flask import Flask, render_template
app = Flask(__name__)
@app.route('/')
@app.route('/index')
def index(chartID = 'chart_ID', chart_type = 'bar', chart_height = 350):
chart = {"renderTo": chartID, "type": chart_type, "height": chart_height,}
series = [{"name": 'Label1', "data": [1,2,3]}, {"name": 'Label2', "data": [4, 5, 6]}]
title = {"text": 'My Title'}
@ivanliu
ivanliu / compitetion
Created June 5, 2017 07:30
competitor analysis
1. The Future of Investing? AI-Run Hedge Funds
https://futurism.com/the-future-of-investing-ai-run-hedge-funds/
aidyia, sentient, rebellionresearch
@ivanliu
ivanliu / py_pkg.howto
Created July 8, 2017 19:29
Packaging python
http://jtushman.github.io/blog/2013/06/17/sharing-code-across-applications-with-python/#3
https://hynek.me/articles/sharing-your-labor-of-love-pypi-quick-and-dirty/
http://python-notes.curiousefficiency.org/en/latest/index.html
@ivanliu
ivanliu / cool_presentation
Last active July 8, 2017 22:07
Impressive presentation/demo
1) A live python demo
http://pyvideo.org/pycon-us-2015/python-concurrency-from-the-ground-up-live.html
code -> https://github.com/ivanliu/concurrencylive