M. Farrajota farrajota

## data-structures.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / data-structures.md
            
            
              Last active
              May 3, 2021 07:57
            
              
                Efficient data structures in python
              
          
    Here I've put the most important data structures I could find with the most efficient implementations in Python I am aware of.
Although this is a continuous process and many structures do not have any implementation reference available yet,
if you happen to know a more efficient algorithm to implement one or more of this structures in Python (or any language) feel
free to ping me :).
Basic Data Structures


Data Structure
Python


Arrays
numpy Tensorflow PyTorch


## s3_util.py
"""
Using dask's multithreaded scheduler to speedup download of multiple files from
an s3 bucket
"""

import os
from functools import partial

import botocore
import boto3

## apache-kafka-productivity-hacks.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / apache-kafka-productivity-hacks.md
            
            
              Created
              September 30, 2019 16:05
                — forked from MichaelDrogalis/apache-kafka-productivity-hacks.md
            
          
    I've been working with Apache Kafka for over 7 years. I inevitably find myself doing the same set of activities while I'm developing or working with someone else's system. Here's a set of Kafka productivity hacks for doing a few things way faster than you're probably doing them now. 🔥

Show me all my Kafka topics and their partitions, replicas, and consumers
Show me the contents of a topic
Create a Kafka topic
Produce messages to a Kafka topic
Validate the schema of messages before producing to a topic
Do all of this at a distance

Get the tools


## missing_value_imputation.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / missing_value_imputation.md
            
            
              Created
              October 25, 2018 13:32
            
              
                How to treat missing values in your data
              
          
    Missing values in data

Types of missing values

Missing completely at random (MCAR)

MCAR exists when missing values are randomly distributed across all observations.
Missingness in given variable does not depend on any other variable, whether observed or unobserved.
MCAR can be confirmed by dividing respondents into those with and without missing data, then using t-tests
of mean differences on income, age, gender, and other key variables to establish that the two groups do not

  
## imputation_rules.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / imputation_rules.md
            
            
              Last active
              December 25, 2021 11:25
            
              
                Rules of thumb for when imputation of missing values should not be used.
              
          
    When imputation should not be used


If data are MCAR, imputation may not be not needed.
If missingness is due to unmeasured variables related to the dependent variable, data are MNAR and should not be imputed.
Imputation assumes data are MAR and should not be used with sparse data. Sparse data occur when missingness is non-random, such as a shopping cart survey of items purchased (coded 1) or not purchased (coded 0), because the null response (0) is non-random, due to unmeasured factors possibly not even known to the shopper.
Imputation should not be used to impute all the data for a subject
Imputation should not be used for a missing value for a given observation if that observation is also missing values on predictively critical variables in the imputation model. While this is difficult to check for each value to be imputed, a table of missing value patterns will show how many cases missing on a given variable also have missing values on other variables. In some cases this may lead a researcher


## checklist.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              3 stars
            
          
                farrajota
                / checklist.md
            
            
              Last active
              June 28, 2023 20:41
            
              
                Data science process with checklists
              
          
    data science checklist


Step 1

Data loading# data science checklist

Step 1

  
## guideline.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / guideline.md
            
            
              Last active
              July 3, 2018 15:49
            
              
                Data cleaning guidelines for multivariate data exploration (using Python's scipy stack).
              
          
    Missing data

A Simple Example of a Missing Data Analysis
Understanding the Reasons Leading to Missing Data
Ignorable Missing Data
Other Types of Missing Data Processes
Examining the Patterns of Missing Data
Diagnosing the Randomness of the Missing Data Process
Rules of Thumb


## git-cheatsheet.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / git-cheatsheet.md
            
            
              Created
              August 30, 2017 22:51
                — forked from eashish93/git-cheatsheet.md
            
              
                My Git Cheatsheet
              
          
    vs Questions


http://stackoverflow.com/questions/804115  (rebase vs merge).
https://www.atlassian.com/git/tutorials/merging-vs-rebasing (rebase vs merge)
https://www.atlassian.com/git/tutorials/undoing-changes/ (reset vs checkout vs revert)
http://stackoverflow.com/questions/2221658 (HEAD^ vs HEAD~) (See git rev-parse)
http://stackoverflow.com/questions/292357 (pull vs fetch)
http://stackoverflow.com/questions/39651 (stash vs branch)
http://stackoverflow.com/questions/8358035 (reset vs checkout vs revert)
http://stackoverflow.com/questions/5798930 (git reset vs git rm --cached)


## bobp-python.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                farrajota
                / bobp-python.md
            
            
              Created
              May 31, 2017 14:10
                — forked from sloria/bobp-python.md
            
              
                A "Best of the Best Practices" (BOBP) guide to developing in Python.
              
          
    The Best of the Best Practices (BOBP) Guide for Python

A "Best of the Best Practices" (BOBP) guide to developing in Python.
In General

Values


"Build tools for others that you want to be built for you." - Kenneth Reitz
"Simplicity is alway better than functionality." - Pieter Hintjens


## mysplittable.lua
local SplitTable, parent = torch.class('nn.MySplitTable', 'nn.Module')

function SplitTable:__init(dimension, nTensors)
   parent.__init(self)
   self.dimension = dimension
   self.nTensors = nTensors
   self.joinTable = nn.JoinTable(dimension)
end

function SplitTable:getSize(input)
	"""
	Using dask's multithreaded scheduler to speedup download of multiple files from
	an s3 bucket
	"""

	import os
	from functools import partial

	import botocore
	import boto3
	local SplitTable, parent = torch.class('nn.MySplitTable', 'nn.Module')

	function SplitTable:__init(dimension, nTensors)
	parent.__init(self)
	self.dimension = dimension
	self.nTensors = nTensors
	self.joinTable = nn.JoinTable(dimension)
	end

	function SplitTable:getSize(input)