Skip to content

Instantly share code, notes, and snippets.

Avatar
🎯
Focusing

Mamonu mamonu

🎯
Focusing
View GitHub Profile
@mamonu
mamonu / browncorpuswordcount.py
Last active Apr 9, 2020
brown corpus word count
View browncorpuswordcount.py
import nltk
import string
# nltk.download('brown')
# if nltk hasnt been used before this will download the brown corpus
from nltk.corpus import brown
from collections import Counter
import pandas as pd
@mamonu
mamonu / data.csv
Created Feb 7, 2020
data for graphs simple
View data.csv
NodeA NodeB similarity
Theodore Theodoras 0.9
Theodore Sam 0.0
Samuel Sam 0.7
@mamonu
mamonu / python-django-postgres-ci.yml
Created Nov 17, 2019 — forked from jefftriplett/python-django-postgres-ci.yml
This is a good starting point for getting Python, Django, Postgres running as a service, pytest, black, and pip caching rolling with GitHub Actions.
View python-django-postgres-ci.yml
name: CI
on: [push]
jobs:
test:
runs-on: ubuntu-latest
services:
View SparkUI.md

I will have to think about a sensible place to put this.
But here’s how you can get the spark UI for a glue job:

job = GlueJob('my_dir/', bucket=bucket, job_role=my_role,
              job_arguments={"--test_arg": 'some_string',
                             '--enable-spark-ui': 'true',
                             '--spark-event-logs-path': 's3://alpha-data-linking/glue_test_delete/logsdelete' })
@mamonu
mamonu / AWK.txt
Created Jun 26, 2019
HANDY ONE-LINE SCRIPTS FOR AWK
View AWK.txt
HANDY ONE-LINE SCRIPTS FOR AWK 30 April 2008
Compiled by Eric Pement - eric [at] pement.org version 0.27
Latest version of this file (in English) is usually at:
http://www.pement.org/awk/awk1line.txt
USAGE:
Unix: awk '/pattern/ {print "$1"}' # standard Unix shells
DOS/Win: awk '/pattern/ {print "$1"}' # compiled with DJGPP, Cygwin
@mamonu
mamonu / exampleproject.md
Created May 15, 2019
contains an example data science project and setup guide
View exampleproject.md

example project

contains an example data science project and setup guide

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
@mamonu
mamonu / ordinalexample.R
Last active Apr 19, 2019
ordinal example
View ordinalexample.R
library(datasets)
library(sjmisc)
library(sjPlot)
library(ordinal)
set.seed(111)
ab <-datasets::airquality
ab$orddep<- as.factor(ab$Month)
ord.1 <- clm(orddep ~ Temp , data=ab)
@mamonu
mamonu / TechnicalReportforthegraphdatabasesproject.md
Created Nov 20, 2018
Technical Report for the graph databases project
View TechnicalReportforthegraphdatabasesproject.md

Technical Report for the graph databases project

In order to create the proof of concept graph databases project a wide number of relevant technologies have been used. These include Neo4j graph database where the transactional processing (INSERTS,UPDATES etc) is handled , igraph API where the analytical processing (graph calculations) is handled, the R programming language together with various R libraries used for things like providing a web interface (Shiny) ,providing interactive visualizations (networkD3) , interactive tables (DT,formattable) and access of the Neo4j

@mamonu
mamonu / pysparkfixtureexample.py
Created Oct 29, 2018
pyspark fixture example
View pysparkfixtureexample.py
@pytest.fixture(scope="session")
def spark_context(request):
""" fixture for creating a spark context
Args:
request: pytest.FixtureRequest object
"""
conf = (SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing"))
sc = SparkContext(conf=conf)
request.addfinalizer(lambda: sc.stop())
@mamonu
mamonu / monoids-and-reductions.md
Last active May 2, 2018 — forked from ludflu/monoids-and-reductions.md
Monoids and map-side reductions using Spark's aggregateByKey
View monoids-and-reductions.md

In a classic hadoop job, you've got mappers and reducers. The "thing" being mapped and reduced are key-value pairs for some arbitrary pair of types. Most of your parallelism comes from the mappers, since they can (ideally) split the data and transform it without any coordination with other processes.

By contrast, the amount of parallelism in the reduction phase has an important limitation: although you may have many reducers, any given reducer is guaranteed to receive all the values for some particular key.

So if there are a HUGE number of values for some particular key, you're going to have a bottleneck because they're all going to be processed by a single reducer.

However, there is another way! Certain types of data fit into a pattern:

  • they can be combined with other values of the same type to form new values.
  • the combining operation is associative. For example, integer addition: ((1 + 2) + 3) == (1 + (2 + 3)) - they have an identity value. (f
You can’t perform that action at this time.