testman42/testman_test

## testman_test

Testman Test

Purpose of this is to develop a framework for determening the quality of "intelligent personal assistants".

1. Core Concepts

As intelligent personal assistants function by responding to user queries, this project aims to provide set of queries of various types and their expected responses.

While the Turing Test is a concept for a test that determines wheather artificial intelligence is advanced enough to pass as a human, it's implementation is not in any way defined.

This project does not aim to provide an implementation of Turing Test. It is however in some ways logical predecesor to the Turing Test, as any AI / IPA that can pass Turing Test should be able to pass the Testman Test.

One of things that we should firstly debate and later agree on is wheather this is a compliance test or a benchmark (Does it test IF the functionality work or HOW WELL does the functionality work)

As there are many types of IPAs, this test should not be limited to just core functionality / overlap of functionality of most IPAs, but also not try to cover every possible functionality.

It should be modular. Have base test that takes in consideration the functionality that would be expected from every API, and then have modules for additional tests, that can be used to determine quality of additional functionality.

I currently imagine that queries could be either fixed (read: written in a list in plaintext) or generated (same queries as in fixed list, but with data selected randomly from database)

2. Versioning

As time progresses, more and more functionality will be added to IPAs. And to keep up with improvements and increased basic functionality, this test should be kept updated and relevant.

But with that comes the problem of keeping track about how to determine, what version of test was used and which modules were included.

I am not a fan of how Creative Commons licences are named / versioned.

Version name should start with TT (as in Testman Test), then last two numerals of current year (currently 17), then flag indicating if test queries were fixed or generated (F or G), and then some way to indicate which modules were used in the test.

I would ask for suggestions here. Ideally we would have a numeric representation, where bigger number means more modules were used.

My first idea was to have a list of all recognised modules listed in certain order, then mark each with 1 if it was used and 0 if it was not, and then convert this binary sequence into decimal number.

But that does not work, as all places are not equal, but each represents value 2x bigger than previous one. Therefore bigger number does not mean more modules.

Please provide suggestions.

3. Modules

	3.1 Diagnostic module

		Diagnostic queries
			- Name
			- Version
			- enabled modules
			- all avaliable modules
			-

	3.2 Main module

		Basic ability check
			- time
				* What time is it?
				* What's the time?
			- date
				* What's the date?
				* What day is it today?
				*
			- spell
				* Spell "example"
				* Spell "spelling"
			- count
				* Count to 5
				* Count down from 5
				* Count from 10 to 15
			- ...


		Basic questions with static answers.
			- retrieveing info from various databases
				* How tall is Big Ben?
				* How long is Grat Wall of China?
				* When was Eiffel Tower built?
				* What is the capital of Russia?
				* When was Alan Turing born?
			- basic arythmetic
				* 10 + 10
				* 1 - 1
				* 7 * 6
				* 20 / 5
				* 0 + 0 - 0 * 0
				* 1 + 1 - 1 * 1
				* 10 + 1 - 7 * 6 / 5
				* 10 / 0
				* 100000000000000000000 + 100000000000000000000
			- basic concept connections
				* Do dogs have legs?
				* Do cats have legs?
				* Do worms have legs?
				* Can fish swim?
				* Can birds fly?
				* Can trains fly?
				* Do cars have wheels?
				* Do computers need electricity to work?
				* Is wing part of an airplane?
				* ...
			- basic logic questions
				* true
				* false
				*
			- data type understanding
				* Is 10 a number?
				* Is 10 a word?
				* Is word made out of letters?
				* ...
			- metric understanding
				* Is kilometer longer than centimeter?
				* Is kilometer longer than mile?
				* Is liter more than a gallon?
				* How many bytes are in kilobyte?
				* How many meters are in kilometer?
				*
			- ...

		Basic questions with dynamic answers.
			- basic time relativity
				* What date will be tomorrow?
				* What time will it be in an hour?
				* How much time until midnight?
				* How many days are in this month?
				* How many days until end of the year?
			- basic distance relativity
				* How far is south pole from here?
				* How far is Mount Everest from here?
				* Are we in London?
				* How far is nearest restaurant?
			- basic arythmetic relativity
				* What time was it 10 minutes ago?
				*
			- retrieving info about (well known) events
				* When were last Olympic games?
				* How long ago was Sputnik launched?
				* Who won last world chess championship?
				* ...
			- ...

		Illogical questions and statements.
			- False info check
			- Concept connection check
				* How many wheels does cat have?
				* How many megabytes can a tea cup hold?
				* Why is blue better than bread?

		Paradoxes
			- Logical paradoxes
				* This sentence is false.
				* Does the set of all sets contain itself?
				*

			- Known thought experiments
				*

		Pop culture references
			- Literary references
				* What is the answer to life, universe and everything?
				*
				* ...
			- Cult movie quotes
			- Responses to cult movie quotes
			- Music lyrics recognition
			- Internet phenomenon

		Gibberish
			- Syntactical gibberish
			- Structural gibberish
			- Random string
			- Random hash
			- Random number

		Misc queries
			-

	3.3 Environment interaction module

		Underlying system interaction

			- Environment recognition
			- Accessing OS functions
			- Accessing filesystem

		Parallel system interaction

			- Interaction with other applications running on same system

		Foregin system interaction

			- Interaction with operating system on a different device
			- Interaction with applications running on a different device


	3.4 Continous conversation module


4. Misc

	Testman Test

	Purpose of this is to develop a framework for determening the quality of "intelligent personal assistants".

	1. Core Concepts

	As intelligent personal assistants function by responding to user queries, this project aims to provide set of queries of various types and their expected responses.

	While the Turing Test is a concept for a test that determines wheather artificial intelligence is advanced enough to pass as a human, it's implementation is not in any way defined.

	This project does not aim to provide an implementation of Turing Test. It is however in some ways logical predecesor to the Turing Test, as any AI / IPA that can pass Turing Test should be able to pass the Testman Test.

	One of things that we should firstly debate and later agree on is wheather this is a compliance test or a benchmark (Does it test IF the functionality work or HOW WELL does the functionality work)

	As there are many types of IPAs, this test should not be limited to just core functionality / overlap of functionality of most IPAs, but also not try to cover every possible functionality.

	It should be modular. Have base test that takes in consideration the functionality that would be expected from every API, and then have modules for additional tests, that can be used to determine quality of additional functionality.

	I currently imagine that queries could be either fixed (read: written in a list in plaintext) or generated (same queries as in fixed list, but with data selected randomly from database)

	2. Versioning

	As time progresses, more and more functionality will be added to IPAs. And to keep up with improvements and increased basic functionality, this test should be kept updated and relevant.

	But with that comes the problem of keeping track about how to determine, what version of test was used and which modules were included.

	I am not a fan of how Creative Commons licences are named / versioned.

	Version name should start with TT (as in Testman Test), then last two numerals of current year (currently 17), then flag indicating if test queries were fixed or generated (F or G), and then some way to indicate which modules were used in the test.

	I would ask for suggestions here. Ideally we would have a numeric representation, where bigger number means more modules were used.

	My first idea was to have a list of all recognised modules listed in certain order, then mark each with 1 if it was used and 0 if it was not, and then convert this binary sequence into decimal number.

	But that does not work, as all places are not equal, but each represents value 2x bigger than previous one. Therefore bigger number does not mean more modules.

	Please provide suggestions.

	3. Modules

	3.1 Diagnostic module

	Diagnostic queries
	- Name
	- Version
	- enabled modules
	- all avaliable modules
	-

	3.2 Main module

	Basic ability check
	- time
	* What time is it?
	* What's the time?
	- date
	* What's the date?
	* What day is it today?
	*
	- spell
	* Spell "example"
	* Spell "spelling"
	- count
	* Count to 5
	* Count down from 5
	* Count from 10 to 15
	- ...


	Basic questions with static answers.
	- retrieveing info from various databases
	* How tall is Big Ben?
	* How long is Grat Wall of China?
	* When was Eiffel Tower built?
	* What is the capital of Russia?
	* When was Alan Turing born?
	- basic arythmetic
	* 10 + 10
	* 1 - 1
	* 7 * 6
	* 20 / 5
	* 0 + 0 - 0 * 0
	* 1 + 1 - 1 * 1
	* 10 + 1 - 7 * 6 / 5
	* 10 / 0
	* 100000000000000000000 + 100000000000000000000
	- basic concept connections
	* Do dogs have legs?
	* Do cats have legs?
	* Do worms have legs?
	* Can fish swim?
	* Can birds fly?
	* Can trains fly?
	* Do cars have wheels?
	* Do computers need electricity to work?
	* Is wing part of an airplane?
	* ...
	- basic logic questions
	* true
	* false
	*
	- data type understanding
	* Is 10 a number?
	* Is 10 a word?
	* Is word made out of letters?
	* ...
	- metric understanding
	* Is kilometer longer than centimeter?
	* Is kilometer longer than mile?
	* Is liter more than a gallon?
	* How many bytes are in kilobyte?
	* How many meters are in kilometer?
	*
	- ...

	Basic questions with dynamic answers.
	- basic time relativity
	* What date will be tomorrow?
	* What time will it be in an hour?
	* How much time until midnight?
	* How many days are in this month?
	* How many days until end of the year?
	- basic distance relativity
	* How far is south pole from here?
	* How far is Mount Everest from here?
	* Are we in London?
	* How far is nearest restaurant?
	- basic arythmetic relativity
	* What time was it 10 minutes ago?
	*
	- retrieving info about (well known) events
	* When were last Olympic games?
	* How long ago was Sputnik launched?
	* Who won last world chess championship?
	* ...
	- ...

	Illogical questions and statements.
	- False info check
	- Concept connection check
	* How many wheels does cat have?
	* How many megabytes can a tea cup hold?
	* Why is blue better than bread?

	Paradoxes
	- Logical paradoxes
	* This sentence is false.
	* Does the set of all sets contain itself?
	*

	- Known thought experiments
	*

	Pop culture references
	- Literary references
	* What is the answer to life, universe and everything?
	*
	* ...
	- Cult movie quotes
	- Responses to cult movie quotes
	- Music lyrics recognition
	- Internet phenomenon

	Gibberish
	- Syntactical gibberish
	- Structural gibberish
	- Random string
	- Random hash
	- Random number

	Misc queries
	-

	3.3 Environment interaction module

	Underlying system interaction

	- Environment recognition
	- Accessing OS functions
	- Accessing filesystem

	Parallel system interaction

	- Interaction with other applications running on same system

	Foregin system interaction

	- Interaction with operating system on a different device
	- Interaction with applications running on a different device


	3.4 Continous conversation module



	4. Misc