jamii/gist:060b49f02609d0813fcf5f0593339fb8

## gistfile1.txt
mdp example? https://www.nature.com/articles/s41598-017-15249-0
foraging? http://rsif.royalsocietypublishing.org/content/14/136/20170376
something better about policies? https://www.mitpressjournals.org/doi/full/10.1162/neco_a_00999
overview of papers. http://www.sciencedirect.com/science/article/pii/S0149763416307096
working software for reading model. http://www.sciencedirect.com/science/article/pii/S0149763416307096
possible alternative formulations. http://www.mdpi.com/1099-4300/19/6/266/htm
meta-bayes. http://www.ipsi.utoronto.ca/sdis/Friston-Paper.pdf
habits? http://www.sciencedirect.com/science/article/pii/S0149763416301336
hierarchical? http://rstb.royalsocietypublishing.org/content/364/1521/1211?ijkey=22b2ec7c7ee727c55840251984a7e9803922b453&keytype2=tf_ipsecsha

original mdp formulation? https://ac.els-cdn.com/S0149763416301336/1-s2.0-S0149763416301336-main.pdf?_tid=fe1ab906-ea62-11e7-9bff-00000aab0f6c&acdnat=1514309835_bd69f1c3960fe58e9a47e2e071855353
more mdp? mountain car. https://www.researchgate.net/profile/Karl_Friston/publication/230619553_Active_inference_and_agency_Optimal_control_without_cost_functions/links/00b7d527be8cc08a1b000000.pdf
mdp and mountain car again http://www.fil.ion.ucl.ac.uk/~karl/What%20is%20value-accumulated%20reward%20or%20evidence.pdf
derives expected utility? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3782702/
how does approximation mess up - https://pdfs.semanticscholar.org/10e3/d75a448c4f90d22bc423674b4ee7227334c3.pdf

original reading:
https://moodle.ucl.ac.uk/pluginfile.php/4313243/mod_resource/content/5/9_Friston2010.pdf

active inference and bandit?

complex bandit examples, not explicitly same framework - http://proceedings.mlr.press/v38/ortega15.pdf
bounded rationality because expectation maximization too expensive - https://casmodeling.springeropen.com/articles/10.1186/2194-3206-2-2
human behavior in bandit - https://cloudfront.escholarship.org/dist/prd/content/qt5xt5z4tv/qt5xt5z4tv.pdf

equivalent to? special case of? possibly better explanation
http://www.adaptiveagents.org/_media/papers:pathintboundedrationality.pdf
https://arxiv.org/pdf/1107.5766.pdf

relationship between em, reml, variational bayes - http://www.fil.ion.ucl.ac.uk/~wpenny/publications/var_laplace.pdf

---

http://rsif.royalsocietypublishing.org/content/13/122/20160616
actual case study in robot
agents generative model is not the same as the world generative model - bakes in desired action
no inverse model? optimal control theory requires inverse model? (is this like forward autodiff?) 'because the robot's generative (or forward) model is inverted during the inference'
'some policies cannot be specified using cost functions but can be described using priors'
too complicated to work from

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096853/
learning only

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003094
bandit example, learning only
doesn't include actions

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006421
mountain car in detail
policies are functions of sensory input
state entropy bounded by sensory entropy + transfer term
ergodic assumption
sensory entropy = integrate equilibrum density over time ie min sensory entropy = min surprise
minimizing surprise in dynamic environment requires controlling environment - cant just wait it out
but cant compute surprise directly, so need to bound it
'The recognition density is a slightly mysterious construct because it is an arbitrary probability density specified by the internal states of the agent. Its role is to induce free-energy, which is a function of the internal states and sensory inputs.'
policies are functions of sensation and internal states of q
'Note that the true states depend on action, whereas the generative model has no notion of action; it just produces predictions that action tries to fulfil.'
here the generative model fully specifies location, and we pick actions to minimize prediction error of that model
precision is variance in prior? governs whether discrepancies are resolved by action or perception? (how do we choose precisions?)
give utility function as a desired equilibrium density
> In brief, learning entails immersing an agent in a controlled environment that furnishes the desired equilibrium density. The agent learns the causal structure of this training environment and encodes it through perceptual learning as described above. This learning induces prior expectations that are retained when the agent is replaced in an uncontrolled or test environment. Because the agent samples the environment actively, it will seek out the desired sensory states that it has learned to expect.
so rather than creating priors by hand, we just put it in an environment where it makes the right choices. can we learn eg regret minimization for bandit?
> They key thing here is that the free-energy principle reduces the problem of learning an optimum policy to the much simpler and well-studied problem of perceptual learning, without reference to action.
^ this seems like a key idea
> Increasing the relative precision of empirical priors on motion causes more confident behaviour, whereas reducing it subverts action, because prior expectations are overwhelmed by sensory input and are therefore not expressed at the level of sensory predictions.
optimal control does some kind of gradient climbing thing, where we have to figure out the value of each state by what end states we can reach from there? hard to solve
free energy allows simply training optimal trajectories and letting the agent figure out the correct policies
optimal control requires knowing hidden states?
acknowledges that they put the solution into the training environment, but still didn't need to learn policy
(what were the parameters that were learned? not clear to me. could be entire trajectory? theta = mu looks like it learns environmental dynamics, mapping from states to senses and states to future states. oh, also params to control polynomial. so learnt a policy that was being directly implemented. not that impressive?)

---

precisions?
in mountain car example, is sensory/motor noise

can we do some picoeconomics?

q is really the desired world, not just an approximation. we act to make the real world look like the desired world.
	mdp example? https://www.nature.com/articles/s41598-017-15249-0
	foraging? http://rsif.royalsocietypublishing.org/content/14/136/20170376
	something better about policies? https://www.mitpressjournals.org/doi/full/10.1162/neco_a_00999
	overview of papers. http://www.sciencedirect.com/science/article/pii/S0149763416307096
	working software for reading model. http://www.sciencedirect.com/science/article/pii/S0149763416307096
	possible alternative formulations. http://www.mdpi.com/1099-4300/19/6/266/htm
	meta-bayes. http://www.ipsi.utoronto.ca/sdis/Friston-Paper.pdf
	habits? http://www.sciencedirect.com/science/article/pii/S0149763416301336
	hierarchical? http://rstb.royalsocietypublishing.org/content/364/1521/1211?ijkey=22b2ec7c7ee727c55840251984a7e9803922b453&keytype2=tf_ipsecsha

	original mdp formulation? https://ac.els-cdn.com/S0149763416301336/1-s2.0-S0149763416301336-main.pdf?_tid=fe1ab906-ea62-11e7-9bff-00000aab0f6c&acdnat=1514309835_bd69f1c3960fe58e9a47e2e071855353
	more mdp? mountain car. https://www.researchgate.net/profile/Karl_Friston/publication/230619553_Active_inference_and_agency_Optimal_control_without_cost_functions/links/00b7d527be8cc08a1b000000.pdf
	mdp and mountain car again http://www.fil.ion.ucl.ac.uk/~karl/What%20is%20value-accumulated%20reward%20or%20evidence.pdf
	derives expected utility? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3782702/
	how does approximation mess up - https://pdfs.semanticscholar.org/10e3/d75a448c4f90d22bc423674b4ee7227334c3.pdf

	original reading:
	https://moodle.ucl.ac.uk/pluginfile.php/4313243/mod_resource/content/5/9_Friston2010.pdf

	active inference and bandit?

	complex bandit examples, not explicitly same framework - http://proceedings.mlr.press/v38/ortega15.pdf
	bounded rationality because expectation maximization too expensive - https://casmodeling.springeropen.com/articles/10.1186/2194-3206-2-2
	human behavior in bandit - https://cloudfront.escholarship.org/dist/prd/content/qt5xt5z4tv/qt5xt5z4tv.pdf

	equivalent to? special case of? possibly better explanation
	http://www.adaptiveagents.org/_media/papers:pathintboundedrationality.pdf
	https://arxiv.org/pdf/1107.5766.pdf

	relationship between em, reml, variational bayes - http://www.fil.ion.ucl.ac.uk/~wpenny/publications/var_laplace.pdf

	---

	http://rsif.royalsocietypublishing.org/content/13/122/20160616
	actual case study in robot
	agents generative model is not the same as the world generative model - bakes in desired action
	no inverse model? optimal control theory requires inverse model? (is this like forward autodiff?) 'because the robot's generative (or forward) model is inverted during the inference'
	'some policies cannot be specified using cost functions but can be described using priors'
	too complicated to work from

	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096853/
	learning only

	http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003094
	bandit example, learning only
	doesn't include actions

	http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0006421
	mountain car in detail
	policies are functions of sensory input
	state entropy bounded by sensory entropy + transfer term
	ergodic assumption
	sensory entropy = integrate equilibrum density over time ie min sensory entropy = min surprise
	minimizing surprise in dynamic environment requires controlling environment - cant just wait it out
	but cant compute surprise directly, so need to bound it
	'The recognition density is a slightly mysterious construct because it is an arbitrary probability density specified by the internal states of the agent. Its role is to induce free-energy, which is a function of the internal states and sensory inputs.'
	policies are functions of sensation and internal states of q
	'Note that the true states depend on action, whereas the generative model has no notion of action; it just produces predictions that action tries to fulfil.'
	here the generative model fully specifies location, and we pick actions to minimize prediction error of that model
	precision is variance in prior? governs whether discrepancies are resolved by action or perception? (how do we choose precisions?)
	give utility function as a desired equilibrium density
	> In brief, learning entails immersing an agent in a controlled environment that furnishes the desired equilibrium density. The agent learns the causal structure of this training environment and encodes it through perceptual learning as described above. This learning induces prior expectations that are retained when the agent is replaced in an uncontrolled or test environment. Because the agent samples the environment actively, it will seek out the desired sensory states that it has learned to expect.
	so rather than creating priors by hand, we just put it in an environment where it makes the right choices. can we learn eg regret minimization for bandit?
	> They key thing here is that the free-energy principle reduces the problem of learning an optimum policy to the much simpler and well-studied problem of perceptual learning, without reference to action.
	^ this seems like a key idea
	> Increasing the relative precision of empirical priors on motion causes more confident behaviour, whereas reducing it subverts action, because prior expectations are overwhelmed by sensory input and are therefore not expressed at the level of sensory predictions.
	optimal control does some kind of gradient climbing thing, where we have to figure out the value of each state by what end states we can reach from there? hard to solve
	free energy allows simply training optimal trajectories and letting the agent figure out the correct policies
	optimal control requires knowing hidden states?
	acknowledges that they put the solution into the training environment, but still didn't need to learn policy
	(what were the parameters that were learned? not clear to me. could be entire trajectory? theta = mu looks like it learns environmental dynamics, mapping from states to senses and states to future states. oh, also params to control polynomial. so learnt a policy that was being directly implemented. not that impressive?)

	---

	precisions?
	in mountain car example, is sensory/motor noise

	can we do some picoeconomics?

	q is really the desired world, not just an approximation. we act to make the real world look like the desired world.