Skip to content

Instantly share code, notes, and snippets.

View danielfrg's full-sized avatar

Daniel Rodriguez danielfrg

View GitHub Profile
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@danielfrg
danielfrg / jython-pig.job
Last active December 26, 2015 22:08
Example on how to run a Jython UDF in AWS EMR The example loads a list of urls, query each url and saves the output. Pig version: 0.11
Register utils.py using jython as utils;
urls = LOAD 'INPUT_FILE' USING PigStorage('\t') AS (url:chararray);
query = FOREACH urls GENERATE utils.query(url) AS everything;
file = FOREACH query GENERATE FLATTEN(everything);
STORE file INTO 's3n://OUTPUT_DIR' USING PigStorage('\t');
@danielfrg
danielfrg / nutch-to-tdf.py
Created November 6, 2013 16:27
Parse nutch segments into a TDF. Light on memory: only one line is loaded at a time and one html is stored at a time; on the other hand is more IO intensive. _input is the dumped html content form nutch: _output is the tdf that is generated requirements: pandas
import pandas as pd
_input = 'dump0'
_output = 'html0.tdf'
df = pd.DataFrame({'url': [], 'html': []})
df.to_csv(_output, sep='\t', index=None)
def append_tdf(urls, html):
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@danielfrg
danielfrg / merge-files-hdfs-count-pipeline.py
Last active October 15, 2018 16:18
Luigi pipeline: 1. Read a bunch of TDF files from local storage and created a big json file in HDFS 2. Uses a hadoop MR job to count the number of words (this is actually a field on each json object)
import json
import luigi
import luigi.hdfs
import luigi.hadoop
import pandas as pd
import numpy
import pandas
luigi.hadoop.attach(numpy, pandas)
@danielfrg
danielfrg / clean-html-solr-pipeline.py
Last active May 6, 2019 12:45
Luigi pipeline that: 1. Reads a tdf file using pandas with html on the 'content' column and created another tdf with just the text of the html (beautifulsoup) 2. Indexes the text into a Solr collection using mysolr
import re
import json
import luigi
import pandas as pd
from mysolr import Solr
from bs4 import BeautifulSoup
class InputText(luigi.ExternalTask):

Keybase proof

I hereby claim:

  • I am danielfrg on github.
  • I am danielfrg (https://keybase.io/danielfrg) on keybase.
  • I have a public key ASDYKve9COIyFov3ozEHC6eHuRZFZqPQq8b1ezthy4hNVgo

To claim this, I am signing this object:

This file has been truncated, but you can view the full file.
<!DOCTYPE html>
<html>
<head><meta charset="utf-8" />
<title>matplotlib</title><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script><link rel="stylesheet" href="https://unpkg.com/font-awesome@4.7.0/css/font-awesome.min.css" type="text/css" />
<style type="text/css">
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@danielfrg
danielfrg / set_campaign_goal.py
Last active January 24, 2024 17:40
Google Ads API set campaign goal
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software