Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

View danielfrg's full-sized avatar

Daniel Rodriguez danielfrg

View GitHub Profile
@danielfrg
danielfrg / set_campaign_goal.py
Last active January 24, 2024 17:40
Google Ads API set campaign goal
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file has been truncated, but you can view the full file.
<!DOCTYPE html>
<html>
<head><meta charset="utf-8" />
<title>matplotlib</title><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script><link rel="stylesheet" href="https://unpkg.com/font-awesome@4.7.0/css/font-awesome.min.css" type="text/css" />
<style type="text/css">

Keybase proof

I hereby claim:

  • I am danielfrg on github.
  • I am danielfrg (https://keybase.io/danielfrg) on keybase.
  • I have a public key ASDYKve9COIyFov3ozEHC6eHuRZFZqPQq8b1ezthy4hNVgo

To claim this, I am signing this object:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@danielfrg
danielfrg / nutch-to-tdf.py
Created November 6, 2013 16:27
Parse nutch segments into a TDF. Light on memory: only one line is loaded at a time and one html is stored at a time; on the other hand is more IO intensive. _input is the dumped html content form nutch: _output is the tdf that is generated requirements: pandas
import pandas as pd
_input = 'dump0'
_output = 'html0.tdf'
df = pd.DataFrame({'url': [], 'html': []})
df.to_csv(_output, sep='\t', index=None)
def append_tdf(urls, html):
@danielfrg
danielfrg / jython-pig.job
Last active December 26, 2015 22:08
Example on how to run a Jython UDF in AWS EMR The example loads a list of urls, query each url and saves the output. Pig version: 0.11
Register utils.py using jython as utils;
urls = LOAD 'INPUT_FILE' USING PigStorage('\t') AS (url:chararray);
query = FOREACH urls GENERATE utils.query(url) AS everything;
file = FOREACH query GENERATE FLATTEN(everything);
STORE file INTO 's3n://OUTPUT_DIR' USING PigStorage('\t');
@danielfrg
danielfrg / merge-files-hdfs-count-pipeline.py
Last active October 15, 2018 16:18
Luigi pipeline: 1. Read a bunch of TDF files from local storage and created a big json file in HDFS 2. Uses a hadoop MR job to count the number of words (this is actually a field on each json object)
import json
import luigi
import luigi.hdfs
import luigi.hadoop
import pandas as pd
import numpy
import pandas
luigi.hadoop.attach(numpy, pandas)
@danielfrg
danielfrg / clean-html-solr-pipeline.py
Last active May 6, 2019 12:45
Luigi pipeline that: 1. Reads a tdf file using pandas with html on the 'content' column and created another tdf with just the text of the html (beautifulsoup) 2. Indexes the text into a Solr collection using mysolr
import re
import json
import luigi
import pandas as pd
from mysolr import Solr
from bs4 import BeautifulSoup
class InputText(luigi.ExternalTask):