didier deshommes dfdeshom

## mr-stdin.py
#!/usr/bin/env python2.7

import json
from mrjob.job import MRJob
import random


class TestStdIn(MRJob):

    @staticmethod

## es-hadoop-settings.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                dfdeshom
                / es-hadoop-settings.md
            
            
              Last active
              April 12, 2021 13:56
            
          
    Paralellism in ES and Hadoop/Spark

1 shard corresponds to 1 Spark partition.
Reading from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-reading . Beware of increasing the number of shards on ES for performance reasons:

A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop or Spark task.

Writing from ES: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/arch.html#arch-writing . Write performance can be increased by having more partitions:

elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distributes the writes between these. The more splits/partitions available, the more mappers/reducers can write data in parallel to Elasticsear


## gist:6277295

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dfdeshom
                / gist:6277295
            
            
              Last active
              December 21, 2015 08:19
            
              
                SolrCloud: The good and the bad
              
          
    We're coming up to a year now since SolrCloud (Solr 4.0) has been released. The company I work for has recently switched to Solr 4.3 and the overall impression has been good, although there has been some growing pains. What follows are my impressions about what I've liked and not liked so far about SolrCloud
The Bad

You can still run Solr in "non-cloud" mode. This means that there are 2 code paths in the lucene-solr repo. It also means that support questions can get a little more complicated. There are a some issues that come up because of this separation:


Configuration is somewhat in flux. The solr.xml file is scheduled for a major change in Solr5 (http://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond) and might completely disappear. schema.xml and solrconfig.xml now live in Zookeeper.


There seems to be some confusion over the cores API and the collections API. The collections API is a nice superset of the cores API but some think they can be used interchangeably. Peo


## output
import redis, tldextract,json

def get_spider_info_for_url(url):
    h = redis.Redis()
    extracted = tldextract.extract(url)
    wholedomain = ".".join([extracted.domain, extracted.tld])
    wholesubdomain = ".".join([extracted.subdomain, extracted.domain, extracted.tld])

    # look for subdomain first
    info = h.hget('spider_info',wholesubdomain)

## gist:654542
8,41d7
<  <image> <url>http://www.examiner.com/sites/all/themes/base/images/logo.gif</url>
<  <title>examiner.com</title>
<  <link>http://www.examiner.com</link>
<  <description>Examiner.com is the insider source for everything local.</description>
<  <width>126</width>
<  <height>29</height>
< </image>
< <item>
<  <title>Paranormal Activity 2: If it&#039;s not a burglar, it must be a ghost.</title>

## gist:654540
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.examiner.com"  xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>Examiner San Jose Edition Articles</title>
 <link>http://www.examiner.com/rss/san-jose</link>
 <description>Latest News and Articles from Examiner.com</description>
 <language>en</language>
 <image> <url>http://www.examiner.com/sites/all/themes/base/images/logo.gif</url>
 <title>examiner.com</title>
 <link>http://www.examiner.com</link>

## gist:654534
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.examiner.com"  xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>Examiner San Jose Edition Articles</title>
 <link>http://www.examiner.com/rss/san-jose</link>
 <description>Latest News and Articles from Examiner.com</description>
 <language>en</language>
<item>
 <title>De Anza College women&#039;s soccer team defeats Cabrillo College 2-0</title>
 <link>http://www.examiner.com/sports-photography-in-san-jose/de-anza-college-women-s-soccer-team-defeats-cabrillo-college-2-0</link>

## gist:654397
$ curl -vv http://www.examiner.com/rss/recent/santa-ana > /dev/null* About to connect() to www.examiner.com port 80 (#0)
*   Trying 67.220.220.55... connected
* Connected to www.examiner.com (67.220.220.55) port 80 (#0)
> GET /rss/recent/santa-ana HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
> Host: www.examiner.com
> Accept: */*
>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

## gist:615250
<script type="text/javascript" src="http://static.parse.ly/p.js?apikey=examiner.com"></script>
<div id="parsely_container" style="display: none">
    <a href="http://parse.ly/p3">Personalization by the Parse.ly Publisher Platform (P3)</a>
</div>

## gist:401223
diff --git a/celerymonitor/handlers/api.py b/celerymonitor/handlers/api.py
index 7afa6d6..e86c614 100644
--- a/celerymonitor/handlers/api.py
+++ b/celerymonitor/handlers/api.py
@@ -12,6 +12,13 @@ def JSON(fun):
     @wraps(fun)
     def _write_json(self, *args, **kwargs):
         content = fun(self, *args, **kwargs)
+        def _any(data):
+            if type(data) == type({}):
	#!/usr/bin/env python2.7

	import json
	from mrjob.job import MRJob
	import random


	class TestStdIn(MRJob):

	@staticmethod
	import redis, tldextract,json

	def get_spider_info_for_url(url):
	h = redis.Redis()
	extracted = tldextract.extract(url)
	wholedomain = ".".join([extracted.domain, extracted.tld])
	wholesubdomain = ".".join([extracted.subdomain, extracted.domain, extracted.tld])

	# look for subdomain first
	info = h.hget('spider_info',wholesubdomain)
	8,41d7
	< <image> <url>http://www.examiner.com/sites/all/themes/base/images/logo.gif</url>
	< <title>examiner.com</title>
	< <link>http://www.examiner.com</link>
	< <description>Examiner.com is the insider source for everything local.</description>
	< <width>126</width>
	< <height>29</height>
	< </image>
	< <item>
	< <title>Paranormal Activity 2: If it's not a burglar, it must be a ghost.</title>
	<?xml version="1.0" encoding="utf-8"?>
	<rss version="2.0" xml:base="http://www.examiner.com" xmlns:dc="http://purl.org/dc/elements/1.1/">
	<channel>
	<title>Examiner San Jose Edition Articles</title>
	<link>http://www.examiner.com/rss/san-jose</link>
	<description>Latest News and Articles from Examiner.com</description>
	<language>en</language>
	<image> <url>http://www.examiner.com/sites/all/themes/base/images/logo.gif</url>
	<title>examiner.com</title>
	<link>http://www.examiner.com</link>
	$ curl -vv http://www.examiner.com/rss/recent/santa-ana > /dev/null* About to connect() to www.examiner.com port 80 (#0)
	* Trying 67.220.220.55... connected
	* Connected to www.examiner.com (67.220.220.55) port 80 (#0)
	> GET /rss/recent/santa-ana HTTP/1.1
	> User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
	> Host: www.examiner.com
	> Accept: /
	>
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	<script type="text/javascript" src="http://static.parse.ly/p.js?apikey=examiner.com"></script>
	<div id="parsely_container" style="display: none">
	<a href="http://parse.ly/p3">Personalization by the Parse.ly Publisher Platform (P3)</a>
	</div>
	diff --git a/celerymonitor/handlers/api.py b/celerymonitor/handlers/api.py
	index 7afa6d6..e86c614 100644
	--- a/celerymonitor/handlers/api.py
	+++ b/celerymonitor/handlers/api.py
	@@ -12,6 +12,13 @@ def JSON(fun):
	@wraps(fun)
	def _write_json(self, args, *kwargs):
	content = fun(self, args, *kwargs)
	+ def _any(data):
	+ if type(data) == type({}):