Takeshi Yamamuro maropu

## ComplExでベクトル化された関係の索引化（with Vertex AI Matching Engine）に関して
 - Complex Embeddings for Simple Link Prediction, https://arxiv.org/abs/1606.06357
 - Vertex AI Matching Engine: https://cloud.google.com/vertex-ai/docs/matching-engine

## memorize failed and canceled tests in scalatest
$ ./build/mvn clean test -DmemoryFiles=rerun.txt
$ cat
TestFailed Some(org.apache.spark.api.python.RepairSuite) org.apache.spark.api.python.RepairSuite None
TestFailed Some(org.apache.spark.api.python.DepGraphSuite) org.apache.spark.api.python.DepGraphSuite Some(computeFunctionalDepMap)

$ ./build/mvn clean test -DtestsFiles=rerun.txt
Run starting. Expected test count is: 5
DepGraphSuite:
13:29:58.598 WARN org.apache.spark.util.Utils: Your hostname, maropus-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.3.2 instead (on interface en0)
13:29:58.599 WARN org.apache.spark.util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address

## spark-janino-v3.1.3
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.

## Collects the elapsed time of internal function calls in pandas_udf
import time
from collections import Counter

from pyspark.accumulators import AccumulatorParam
from pyspark.sql.functions import col, pandas_udf, PandasUDFType

class UdfMetricAccumulatorParam(AccumulatorParam):
    def zero(self, value):
        init_value = {}
        return init_value.update(value)

## show_elapsed_time.py
@pytest.hookimpl(hookwrapper=True)
def pytest_report_teststatus(report, config):
    outcome = yield
    res = outcome.get_result()

    attr_name = "___TIME___"
    if report.when == "setup":
        # HACK: store the start time in `config`
        setattr(config, attr_name, time.time())
    elif report.when == "call":

## Dump TPCDS data stats (SPARK-32564)
// export SPARK_HOME=<YOUR_SPARK_V3_0>
$ git clone https://github.com/maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen
$ ./bin/datagen --master=local[*] --conf spark.driver.memory=8g --scale-factor 10 --output-location /tmp/tpcds-sf-10
scala> :paste
import org.apache.spark.sql.catalyst.catalog.CatalogColumnStat
import org.apache.spark.sql.execution.datasources.LogicalRelation
import org.apache.spark.sql.types.DataType

sql("SET spark.sql.cbo.enabled=true")

## Markov Logic Network
# https://qiita.com/9_ties/items/3bdb177384937ddc88df
# https://homes.cs.washington.edu/~pedrod/papers/mlj05.pdf
import pandas as pd
import numpy as np
from scipy.special import logsumexp
from itertools import product

const = ['A', 'B']
preds = [('Smokes', 1), ('Cancer', 1), ('Friends', 2)]  # Predicate and arity

## Scala Reflection API
///////// Invocation of Scala collection object methods /////////

---
scala> import scala.reflect.runtime.universe._

scala> val mapClazz = scala.collection.immutable.Map.getClass
mapClazz: Class[_ <: scala.collection.immutable.Map.type] = class scala.collection.immutable.Map$

scala> val mirror = runtimeMirror(mapClazz.getClassLoader)
mirror: reflect.runtime.universe.Mirror = JavaMirror with ...

## Snappy+BitShuffle

      
        
          
            
              
              1 file
            
          
          
            
              
              0 forks
            
          
          
            
              
              1 comment
            
          
          
            
              
              2 stars
            
          
        
        
          
              
          
          
            
                maropu
                / Snappy+BitShuffle
            
            
              Last active
              October 19, 2020 06:17
            
          
        
      
        
          
            
              
              We couldn’t find that file to show.
	- Complex Embeddings for Simple Link Prediction, https://arxiv.org/abs/1606.06357
	- Vertex AI Matching Engine: https://cloud.google.com/vertex-ai/docs/matching-engine
	$ ./build/mvn clean test -DmemoryFiles=rerun.txt
	$ cat
	TestFailed Some(org.apache.spark.api.python.RepairSuite) org.apache.spark.api.python.RepairSuite None
	TestFailed Some(org.apache.spark.api.python.DepGraphSuite) org.apache.spark.api.python.DepGraphSuite Some(computeFunctionalDepMap)

	$ ./build/mvn clean test -DtestsFiles=rerun.txt
	Run starting. Expected test count is: 5
	DepGraphSuite:
	13:29:58.598 WARN org.apache.spark.util.Utils: Your hostname, maropus-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.3.2 instead (on interface en0)
	13:29:58.599 WARN org.apache.spark.util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
	Spark session available as 'spark'.
	Welcome to
	____ __
	/ __/__ ___ _____/ /__
	_\ \/ _ \/ _ `/ __/ '_/
	/___/ .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT
	/_/

	Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
	Type in expressions to have them evaluated.
	import time
	from collections import Counter

	from pyspark.accumulators import AccumulatorParam
	from pyspark.sql.functions import col, pandas_udf, PandasUDFType

	class UdfMetricAccumulatorParam(AccumulatorParam):
	def zero(self, value):
	init_value = {}
	return init_value.update(value)
	@pytest.hookimpl(hookwrapper=True)
	def pytest_report_teststatus(report, config):
	outcome = yield
	res = outcome.get_result()

	attr_name = "___TIME___"
	if report.when == "setup":
	# HACK: store the start time in `config`
	setattr(config, attr_name, time.time())
	elif report.when == "call":
	// export SPARK_HOME=<YOUR_SPARK_V3_0>
	$ git clone https://github.com/maropu/spark-tpcds-datagen.git
	$ cd spark-tpcds-datagen
	$ ./bin/datagen --master=local[*] --conf spark.driver.memory=8g --scale-factor 10 --output-location /tmp/tpcds-sf-10
	scala> :paste
	import org.apache.spark.sql.catalyst.catalog.CatalogColumnStat
	import org.apache.spark.sql.execution.datasources.LogicalRelation
	import org.apache.spark.sql.types.DataType

	sql("SET spark.sql.cbo.enabled=true")
	# https://qiita.com/9_ties/items/3bdb177384937ddc88df
	# https://homes.cs.washington.edu/~pedrod/papers/mlj05.pdf
	import pandas as pd
	import numpy as np
	from scipy.special import logsumexp
	from itertools import product

	const = ['A', 'B']
	preds = [('Smokes', 1), ('Cancer', 1), ('Friends', 2)] # Predicate and arity
	///////// Invocation of Scala collection object methods /////////

	---
	scala> import scala.reflect.runtime.universe._

	scala> val mapClazz = scala.collection.immutable.Map.getClass
	mapClazz: Class[_ <: scala.collection.immutable.Map.type] = class scala.collection.immutable.Map$

	scala> val mirror = runtimeMirror(mapClazz.getClassLoader)
	mirror: reflect.runtime.universe.Mirror = JavaMirror with ...