Skip to content

Instantly share code, notes, and snippets.

javrasya /
Last active March 4, 2023 14:33
This is downloading Wikipedia page views data set and uploading it to S3 as it downloads concurrently. Concurrency can be configured as well with semaphores.
import asyncio
import zlib
from typing import List, Tuple
from aiobotocore.session import AioSession
from aiohttp_retry import ExponentialRetry, RetryClient
from tqdm import tqdm
# ##### PARAMETERIZED PART #######
YEAR = 2015
çalıştır = 1
package dal.ahmet.hive.unittest;
import com.klarna.hiverunner.HiveShell;
import com.klarna.hiverunner.StandaloneHiveRunner;
import com.klarna.hiverunner.annotations.HiveSQL;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
View student_count_report.sql
-- execute_student_count_report.hql
use mydatabase;
INSERT INTO TABLE student_count_report
count(student.student_id) as cnt
FROM school
LEFT JOIN student on student.school_id = school.school_id
View hiverunner-dep.xml

Pip is a package manager of python. You can download Python libraries from some Python repositories like PyPI. You can also download libraries from a git repository. This is gonna be the issue to be explained in this article.

I don't like to memorize things all the time. So, I guess, I couldn't be working without internet :). Whenever I need to install some python libraries from a git repositories, I see a lot of way to do it. It is really confusing. This should be the reason why I can't memorize it. I can see how a very simple requirement is handled with to many confusing way. There shouldn't be to many way. Some of them is not working neither. At last, I decided to blog it.

As you may know, you can use two protocols which are http and ssh to do something on git repositories. Using protocol ssh instead of http may provide some ease of use. Because of nature of ssh, you can do something with your primary/public keys. So, you don't have to input your credentials all the time. But I'll be

View count_letters.sql
select count_letters('name');
View count_letters.sql
select count_letters('name')
package dal.ahmetdal.hive.udf.lettercount;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class LetterCounter extends UDF {
public Integer evaluate(final Text input) {
if (input == null) return null;
return input.toString().length();

Apache Hive is a project which provides SQL dsl which is HiveQL on top of map-reduce in hadoop ecosystem. Mapper(s) and reducer(s) are produced by hive according to given SQL. It is an alternative to Apache Pig.

There are too many built-in functions in Hive. But sometimes we need to have our custom functions. This custom functions are called as UDF which is user defined functions.

UDFs can be written in any language which can be built as jar. For example, if it is in clojure, it needs to be built as jar at the end.

After we generate our jar file contains UDF code, we need to send it to hive auxiliary library folder. This folder is defined as a folder which contains extra libraries for hive. Hive validates and load them and also informs Hadoop-MapReduce(Yarn) about the libraries to make them loaded. Because, our UDF code is actually invoked in map-reduce job, not by