Skip to content

Instantly share code, notes, and snippets.

#!/usr/bin/env python2.7
import json
from mrjob.job import MRJob
import random
class TestStdIn(MRJob):
@staticmethod

Paralellism in ES and Hadoop/Spark

1 shard corresponds to 1 Spark partition.

Reading from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-reading . Beware of increasing the number of shards on ES for performance reasons:

A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop or Spark task.

Writing from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-writing . Write performance can be increased by having more partitions:

elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distributes the writes between these. The more splits/partitions available, the more mappers/reducers can write data in parallel to Elasticsear

@dfdeshom
dfdeshom / gist:6277295
Last active December 21, 2015 08:19
SolrCloud: The good and the bad

We're coming up to a year now since SolrCloud (Solr 4.0) has been released. The company I work for has recently switched to Solr 4.3 and the overall impression has been good, although there has been some growing pains. What follows are my impressions about what I've liked and not liked so far about SolrCloud

The Bad

You can still run Solr in "non-cloud" mode. This means that there are 2 code paths in the lucene-solr repo. It also means that support questions can get a little more complicated. There are a some issues that come up because of this separation:

  • Configuration is somewhat in flux. The solr.xml file is scheduled for a major change in Solr5 (http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond) and might completely disappear. schema.xml and solrconfig.xml now live in Zookeeper.

  • There seems to be some confusion over the cores API and the collections API. The collections API is a nice superset of the cores API but some think they can be used interchangeably. Peo

import redis, tldextract,json
def get_spider_info_for_url(url):
h = redis.Redis()
extracted = tldextract.extract(url)
wholedomain = ".".join([extracted.domain, extracted.tld])
wholesubdomain = ".".join([extracted.subdomain, extracted.domain, extracted.tld])
# look for subdomain first
info = h.hget('spider_info',wholesubdomain)
@dfdeshom
dfdeshom / gist:654542
Created October 29, 2010 22:11
diff (curl - superfeedr)
8,41d7
< <image> <url>http://www.examiner.com/sites/all/themes/base/images/logo.gif</url>
< <title>examiner.com</title>
< <link>http://www.examiner.com</link>
< <description>Examiner.com is the insider source for everything local.</description>
< <width>126</width>
< <height>29</height>
< </image>
< <item>
< <title>Paranormal Activity 2: If it&#039;s not a burglar, it must be a ghost.</title>
@dfdeshom
dfdeshom / gist:654540
Created October 29, 2010 22:09
curl crawl
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.examiner.com" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Examiner San Jose Edition Articles</title>
<link>http://www.examiner.com/rss/san-jose</link>
<description>Latest News and Articles from Examiner.com</description>
<language>en</language>
<image> <url>http://www.examiner.com/sites/all/themes/base/images/logo.gif</url>
<title>examiner.com</title>
<link>http://www.examiner.com</link>
@dfdeshom
dfdeshom / gist:654534
Created October 29, 2010 22:07
superfeedr crawl
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.examiner.com" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Examiner San Jose Edition Articles</title>
<link>http://www.examiner.com/rss/san-jose</link>
<description>Latest News and Articles from Examiner.com</description>
<language>en</language>
<item>
<title>De Anza College women&#039;s soccer team defeats Cabrillo College 2-0</title>
<link>http://www.examiner.com/sports-photography-in-san-jose/de-anza-college-women-s-soccer-team-defeats-cabrillo-college-2-0</link>
$ curl -vv http://www.examiner.com/rss/recent/santa-ana > /dev/null* About to connect() to www.examiner.com port 80 (#0)
* Trying 67.220.220.55... connected
* Connected to www.examiner.com (67.220.220.55) port 80 (#0)
> GET /rss/recent/santa-ana HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
> Host: www.examiner.com
> Accept: */*
>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
<script type="text/javascript" src="http://static.parse.ly/p.js?apikey=examiner.com"></script>
<div id="parsely_container" style="display: none">
<a href="http://parse.ly/p3">Personalization by the Parse.ly Publisher Platform (P3)</a>
</div>
diff --git a/celerymonitor/handlers/api.py b/celerymonitor/handlers/api.py
index 7afa6d6..e86c614 100644
--- a/celerymonitor/handlers/api.py
+++ b/celerymonitor/handlers/api.py
@@ -12,6 +12,13 @@ def JSON(fun):
@wraps(fun)
def _write_json(self, *args, **kwargs):
content = fun(self, *args, **kwargs)
+ def _any(data):
+ if type(data) == type({}):