Skip to content

Instantly share code, notes, and snippets.

Avatar

Antonio Piccolboni piccolbo

View GitHub Profile
@piccolbo
piccolbo / My atom data science bundle.md
Last active Aug 24, 2019
A list of atom extensions I use for data science.
View My atom data science bundle.md

Code execution

  • Hydrogen: allows to send any code selection to a kernel and visualize the results in-text, plus watches and more
  • hydrogen-launcher: launch terminal or iPython

The git bundle

Some of this may be superseeded by native git integration:

  • git-blame: find who wrote that cryptic code
@piccolbo
piccolbo / pypi-release-checklist2.md
Last active Jul 1, 2020 — forked from audreyfeldroy/pypi-release-checklist2.md
My PyPI Release Checklist 2 (now with bumpversion)
View pypi-release-checklist2.md
  • merge any development branch you need to merge
  • git checkout master
  • run test
make install-dev
make test
  • when test pass git push
  • Check Travis for problems (this has been added here because local tests sometime fail to find deps issues)
  • Update HISTORY.rst
@piccolbo
piccolbo / keybase.md
Created Oct 15, 2017
keybase identification
View keybase.md

Keybase proof

I hereby claim:

  • I am piccolbo on github.
  • I am piccolbo (https://keybase.io/piccolbo) on keybase.
  • I have a public key ASATO-Kj3cWENOHAPB5OgNFMlc4xEUtScX1L0-Er8tYX-Ao

To claim this, I am signing this object:

View emr_spark_thrift_on_yarn
#on cluster
thrift /spark/sbin/start-thriftserver.sh --master yarn-client
#ssh tunnel, direct 10000 to unused 8157
ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 hadoop@ec2-54-221-27-21.compute-1.amazonaws.com
#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools
@piccolbo
piccolbo / Fully-tele-businesses.md
Last active May 2, 2016
Fully remote business
View Fully-tele-businesses.md

Criteria: most employess remote all the time. No office space allocated or available. No relocation required. Low travel requirement.

Company Notes
Automattic 4 weeks/year travel all employees
Rocana
RStudio
Plex
Open Knowledge Foundation Time zone restricted -3:1, different contracts UK/non UK
@piccolbo
piccolbo / dplyr-backends.md
Last active Jun 23, 2018
Dplyr backends: the ultimate collection
View dplyr-backends.md

Dplyr is a well known R package to work on structured data, either in memory or in DB and, more recently, in cluster. The in memory implementations have in general capabilities that are not found in the others, so the notion of backend is used with a bit of a poetic license. Even the different DB and cluster backends differ in subtle ways. But it sure is better than writing SQL directly! Here I provide a list of backends with links to the packages that implement them when necessary. I've done my best to provide links to active projects, but I am not endorsing any of them. Do your own testing. Enjoy and please contribute any corrections or additions, in the comments.

Backend Package
data.frame builtin
data.table builtin
arrays builtin
SQLite builtin
PostgreSQL/Redshift builtin
View names scope.Rmd
```{r}
ff = function(){}
names(ff) = "abc"
# Error in names(ff) = "abc" : names() applied to a non-vector
is.vector(mtcars)
#[1] FALSE
names(mtcars) = LETTERS[1:11]
names(mtcars)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
```
View release process.md
  • merge into master
  • update version #
  • update date
  • update Rd help()
  • push master
  • Repeat until tests pass
    • test local and debug
    • test remote and debug
    • test additional platforms
    • apply necessary fixes
@piccolbo
piccolbo / vecgroup.md
Last active Aug 29, 2015
Vectorized grouped ops in plyrmr
View vecgroup.md

Goal is to expose the vectorize group feature of rmr2 in a plyrmr way

What

  1. Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
  2. vectorized.reduce should be propagated along a pipe when possible. Rules TBD
  3. A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
  4. Wordcount is our guiding app here.

How

@piccolbo
piccolbo / gist:58a69cdc80fb8e4f6dc7
Last active Aug 29, 2015
Problems using R serialization to communicate with MR or Spark
View gist:58a69cdc80fb8e4f6dc7
  • Slow. Slow even at the C level, for small objects. Non-vectorized.
  • Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
  • Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
  • Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.
You can’t perform that action at this time.