Ashish Dutt duttashi

## solving_smartgiterror.txt
Environment
Using Smartgit version 21.1
Windows 10 environment

Error:
unable to read askpass response from 'askpass.cmd'
could not read Username for 'https://github.com': terminal prompts disabled
also unable to launch command prompt

Solution:

## cleaning_text_data_using_regex.py
# suppose the text data is loaded in a dataframe called, df.
# using regular expressions to clean the text data

#Remove twitter handlers
df.text = df.text.apply(lambda x:re.sub('@[^\s]+','',x))

#remove hashtags
df.text = df.text.apply(lambda x:re.sub(r'\B#\S+','',x))

# Remove URLS

## Can't push to GitHub because of large file which I already deleted.md

      
              1 file
            
          
              2 forks
            
          
              11 comments
            
          
              3 stars
            
          
                duttashi
                / Can't push to GitHub because of large file which I already deleted.md
            
            
              Created
              March 12, 2021 06:40
            
          
    Problem statement

Did some data analysis which resulted in generating a huge data file, greater than 100 MB. Accidentaly, tried to push it to Github and the nightmares began!
Keep getting error messages, cant push because large files detected.
Solution

The following solution worked for me;

Open command shell in the repo


## train_validate_test_split.py
# create a custom function to split data into 3 sets

import numpy as np

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end

## replace_empty_level_in_factor_var.R
# load required libraries
library(tidyverse)
# READ DATA IN MEMORY
df_train<- read.csv("kaggle_fake_job_prediction/data/fake_job_postings.csv",
                    header=T, na.strings=c(" ","NA"), stringsAsFactors = FALSE, strip.white = TRUE)

# create copy
df<- df_train
# coerce character vars to factor for data cleanup
df<- df %>%

## coerce_multiple_character_vars_to_factor.R
# load required libraries
library(tidyverse)
# READ DATA IN MEMORY
df_train<- read.csv("kaggle_fake_job_prediction/data/fake_job_postings.csv",
                    header=T, na.strings=c(" ","NA"), stringsAsFactors = FALSE, strip.white = TRUE)

# create copy
df<- df_train
df<- df %>%
  mutate_if(is.character, funs(factor(.)))

## two_hot_encoder_for_categorical_data.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                duttashi
                / two_hot_encoder_for_categorical_data.md
            
            
              Created
              June 20, 2019 01:40
            
          
    There 3 options how to convert categorical features to numerical:


Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".


Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.


Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "mor


## ggplotRegression.R
# create function to plot linear regression results
# adapted from https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
ggplotRegression <- function (fit) {
  lmdf<- data.frame(fitted_values = fit$fitted.values, actual_values = fit$model[, 1])
  print(names(lmdf))
  ggplot(lmdf, aes(x = actual_values, y = fitted_values)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0) +
    labs(title = paste("Adj R2 = ", signif(summary(fit)$adj.r.squared, 4),
                       "Intercept =",signif(fit$coef[[1]],5 ),

## separate_categorical_continuous_variables.r
# Ensure the data is read as a dataframe and that the categorical variables are read as factors and not characters.
# A minimum reprex is given below

# load the adult dataset from the UCI ML repo.
library(data.table)
dt<- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
           header = FALSE, sep = ",", stringsAsFactors = TRUE)
# coerce data table to data frame
dt<- as.data.frame(dt)
head(dt)

## download_data_from_url.r
# Apparently the problem lies in https. The function read.csv() in R fails at this. I tried RCurl's getURL() still same error.
# Then I tried fread() from library(data.table) and it worked.
# I give below a minimum reproducible example to download data from a https base webpage.

# load the adult dataset
library(data.table)
dt<- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header = FALSE, sep=",")

head(dt)
   V1               V2     V3        V4 V5                 V6
	Environment
	Using Smartgit version 21.1
	Windows 10 environment

	Error:
	unable to read askpass response from 'askpass.cmd'
	could not read Username for 'https://github.com': terminal prompts disabled
	also unable to launch command prompt

	Solution:
	# suppose the text data is loaded in a dataframe called, df.
	# using regular expressions to clean the text data

	#Remove twitter handlers
	df.text = df.text.apply(lambda x:re.sub('@[^\s]+','',x))

	#remove hashtags
	df.text = df.text.apply(lambda x:re.sub(r'\B#\S+','',x))

	# Remove URLS
	# create a custom function to split data into 3 sets

	import numpy as np

	def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
	np.random.seed(seed)
	perm = np.random.permutation(df.index)
	m = len(df.index)
	train_end = int(train_percent * m)
	validate_end = int(validate_percent * m) + train_end
	# load required libraries
	library(tidyverse)
	# READ DATA IN MEMORY
	df_train<- read.csv("kaggle_fake_job_prediction/data/fake_job_postings.csv",
	header=T, na.strings=c(" ","NA"), stringsAsFactors = FALSE, strip.white = TRUE)

	# create copy
	df<- df_train
	# coerce character vars to factor for data cleanup
	df<- df %>%
	# create function to plot linear regression results
	# adapted from https://sejohnston.com/2012/08/09/a-quick-and-easy-function-to-plot-lm-results-in-r/
	ggplotRegression <- function (fit) {
	lmdf<- data.frame(fitted_values = fit$fitted.values, actual_values = fit$model[, 1])
	print(names(lmdf))
	ggplot(lmdf, aes(x = actual_values, y = fitted_values)) +
	geom_point() +
	geom_abline(slope = 1, intercept = 0) +
	labs(title = paste("Adj R2 = ", signif(summary(fit)$adj.r.squared, 4),
	"Intercept =",signif(fit$coef[[1]],5 ),
	# Ensure the data is read as a dataframe and that the categorical variables are read as factors and not characters.
	# A minimum reprex is given below

	# load the adult dataset from the UCI ML repo.
	library(data.table)
	dt<- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
	header = FALSE, sep = ",", stringsAsFactors = TRUE)
	# coerce data table to data frame
	dt<- as.data.frame(dt)
	head(dt)
	# Apparently the problem lies in https. The function read.csv() in R fails at this. I tried RCurl's getURL() still same error.
	# Then I tried fread() from library(data.table) and it worked.
	# I give below a minimum reproducible example to download data from a https base webpage.

	# load the adult dataset
	library(data.table)
	dt<- fread("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header = FALSE, sep=",")

	head(dt)
	V1 V2 V3 V4 V5 V6