Skip to content

Instantly share code, notes, and snippets.

View wenming's full-sized avatar

Wenming Ye wenming

  • Amazon Web Services
  • Redmond, WA
View GitHub Profile
@wenming
wenming / gutenbergcrawl
Last active December 15, 2015 14:58
Gutenberg crawler that copies english only documents
#!/usr/bin/python
# version 0.1 Wenming Ye 2/25/2012
#Extract English and Text only content out of the Gutenberg DVD. 2010
# If you have questions, please contact me for the latest version.
# feel free to modify the scripts to your needs.
# STEP 1: Run this in the Cygwin Environment. if you don't want to use Cygwin, you can modify "cp command embeded in the script".
# This file parses the html index pages (TITLES) and find english Language books and their ZIP resource URLs.
# Run this in the gutenberg main INDEXES dir in gutenberg "www.gutenberg.org/INDEXES"
# Removes pdf, html, and images, and non-english items. All the zip files will be copied into the INDEXES/zips
# STEP 2: Then you can extract all the zip files by running >>>>find ./ -name "*.zip" -exec unzip -o {} \;<<<<
@wenming
wenming / gist:5174396
Created March 16, 2013 01:02
blob python
from azure.storage import *
import base64
import os
def upload(blob_service, container_name, blob_name, file_path):
blob_service.create_container(container_name, None, None, False)
blob_service.put_blob(container_name, blob_name, '', 'BlockBlob')
chunk_size = 65536
block_ids = []
REM install the redist so that openssl will not complain w/vc9 -- could use %ROLEROOT% here
vcredist_x86.exe /q
REM set a path for openssl and freetds
SET PATH=%PATH%;E:\lib\freedts;E:\lib\openssl
REM Download directly from rubyinstaller.org
E:
powershell -c "(new-object System.Net.WebClient).DownloadFile('http://rubyforge.org/frs/download.php/75894/railsinstaller-2.1.0.exe', 'railsinstaller.exe')"
REM install silently
railsinstaller.exe /verysilent /dir="%RUBY_PATH%" /tasks="assocfiles,modpath"
REM remove any tiny tds copies
@wenming
wenming / gist:4237117
Created December 7, 2012 22:38
list for webcamp repos
Microsoft-Web/PRESENTATION-Keynote
Microsoft-Web/PRESENTATION-BuildingServiceLayerWithASPNETWebAPI
Microsoft-Web/DEMO-BuildingForTheMobileWeb
Microsoft-Web/DEMO-BuildingServiceLayerWithASPNETWebAPI
Microsoft-Web/PRESENTATION-BuildingForTheMobileWeb
Microsoft-Web/PRESENTATION-BuildingSocialWebApps
Microsoft-Web/DEMO-BuildingSocialWebApps
Microsoft-Web/PRESENTATION-UsingCloudApplicationServices
Microsoft-Web/PRESENTATION-HTML5andjQuery
Microsoft-Web/PRESENTATION-RealtimeCommunicationsWithSignalR
@wenming
wenming / millionblobdownload.txt
Created August 13, 2012 01:24
downloading millions of files from blob storage fast
Raw notes for downloading 6.6+ million files from blob storage within hours using a few simple tools on a single machine.
1. Get a list of files from blob storage. A few lines of c# code will do.
//In app config.
<configuration>
<appSettings>
<add key="StorageConnectionString"
value="DefaultEndpointsProtocol=https;AccountName=storagename;AccountKey=yourkey" />
</appSettings>
@wenming
wenming / parseJson.py
Created July 25, 2012 22:15
parse json
#!/usr/bin/python
import os
import sys
import json
import pprint
file = open("twitter_stream_seq2.txt", 'r')
lines = file.readlines()
i = 0
str = ""
@wenming
wenming / hadooponazure
Created June 16, 2012 14:41
Resources for Hadoop on Windows Azure
Hadooponazure.com is strictly a private CTP for microsoft's hadoop distro. It supports HIVE, PIG, a javascript console, a web portal. You can also terminal service into the actual clusters as needed. There's a lot of tutorials in the training kit, there's a deck and there's a bunch of tutorials.
You should also be able to find content on windowsazure.com
http://www.windowsazure.com/en-us/develop/net/scenarios/big-data/
http://www.windowsazure.com/en-us/develop/net/how-to-guides/hadoop/
I recommend going through at least one of these tutorials:
http://www.windowsazure.com/en-us/develop/net/tutorials/hadoop-marketplace/
and perhaps look at this deck in addition to the one included in the training kit. the http://view.officeapps.live.com/op/view.aspx?src=http%3a%2f%2fvideo.ch9.ms%2fteched%2f2012%2fna%2fAZR325.pptx
@wenming
wenming / gist:2941492
Created June 16, 2012 14:32
Resources for Azure scheduler
Hefinition of HPC:
High Performance Computing (HPC) is the use of servers, clusters, and supercomputers – plus associated software, tools, components, storage, and services – for scientific, engineering, or analytical tasks that are particularly intensive in computation, memory usage, or data management. HPC is used by scientists and engineers both in research and in production across industry, government, and academia. Within industry, HPC can frequently be distinguished from general business computing in that companies generally will use HPC applications to gain advantage in their core endeavors – e.g., finding oil, designing automobile parts, or protecting clients’ investments – as opposed to non-core endeavors such as payroll management or resource planning.
Azure HPC scheduler is a great way to run batch workload including but not limited to HPC.
The Azure HPC Scheduler includes 3 programming models:
MPI, SOA, and Parametric sweep.
MPI is a traditional HPC programming model which you can look up on
@wenming
wenming / Twitter (json format).js
Created June 8, 2012 02:00 — forked from gnip/Twitter (json format).js
Twitter Sample Payload, JSON format
{
"coordinates": null,
"created_at": "Thu Oct 21 16:02:46 +0000 2010",
"favorited": false,
"truncated": false,
"id_str": "28039652140",
"entities": {
"urls": [
{
"expanded_url": null,
program convertsongdat
c
integer nrows, ncols, tnnz
integer nrow(384546), matrix(384546)
integer rowe, matrixe, rowi, coli, ncol, nonzero
c
nrows=384546
ncols=1019318
tnnz=48373586