alikins/gist:373bcce9e283bb37bb5e78ce665e519c

## gistfile1.txt


Common problem areas

Ssh

Errors from ‘ssh’ command are obscure
But not meaningless, so they could be processed into more useful ansible error messages
Early versions of https://github.com/ansible/ansible/pull/17598
https://github.com/ansible/ansible/pull/16649

Default verbosity level hides useful troubleshooting info
Ie, no stderr
Even at -vvv or higher, the ssh stderr output is presented in a unreadable way by default callbacks
controlMaster/controlPersist also hides/obscures ssh troubleshooting info

No easy way to temporarily disable it
No easy way to collect ssh config info from ansible
   ‘ssh -G’
- Running ‘ssh’ from the cli is not equivalent to how it is invoked from ansible
  - Even when cut & pasting the ssh command line shown at higher verbosity levels (doesn’t take env vars or user config into account)
     https://github.com/ansible/ansible/pull/23241

- Playbook include unpredictability/complexity
    - No one understands what a ‘static’ include is or why it causes/fixes problems
- Conditionals and handlers on or in includes are very confusing
  - Pretty much no one understands the interactions
- Lots of unexpected include behavior filed as issues
- Undefined var behavior
- Obtuse error messages [1]
- Variable precedence
  - Figuring out why a variable ended up with a particular value is almost impossible
   - Proposal: https://github.com/ansible/ansible/compare/devel...alikins:varman_show_precedence (example output: https://gist.github.com/alikins/405352d8521ce8792fc2f72fde26f9ef)
YAML errors
- Current yaml errors are pretty good, but they are often still really vague. And until https://github.com/ansible/ansible/pull/24468 was recently merged, often wrong.
- Jinja2/Templating debugging

- ansible var scope/precedence info [prototyped: varman_show_precedence branch]
  - see var_man_changed_show_change branch
  - add scope/scope_info to playbook Base (Block?)
    - scope label ('role_vars', 'extra_vars', etc)
    - scope info
        - task/role/etc name
        - file path of file var came from ('group_vars/all', 'role/myrole/defaults/main.yml', etc)
  - various 'get a dict of vars from some source' bits called in vars.manager.Manager.get_vars() could
    return the info
    - as atts on the object?
    - as internal/magic dict items?  ('_ansible_scope_label', etc)


- vault extensibility

  - further split envelope/wrappers out
  - plugins for vault secrets
  - plugins for vault ciphers
  - figure out what we need to enable PKI based tools (ie, gpg)
  - make password rounds configurable
    - likely needs to add the number of rounds to envelope format

   - vault edit inline vault
      - parse yaml
      - decrypt !vault-encrypted
      - replace with !vault-plaintext
      - parse saved yaml
      - encrypted !vault-plaintext
      - save with encrypted values
        - can't serialize to yaml
        - need to do text munging

- better serialization/dump/yaml of playbook related objects

    Goal is to make the playbook objects easier to debug and troubleshoot for users and developers.

    # easier, more useful parts

    - for dumping parsed/compiled Playbook object as it exist before execution
    - start with leaf nodes (FieldAttributes)
    - get unsafe canonical yaml working
    - get safe canonical yaml working
    - get safe non-canonical yaml working (ie, output that looks like a playbook)
    - proceed up the tree
    - make sure container/list type objects know
      to serialize their contents
    - add debug/troubleshooting hooks for displaying/presenting/persisting this info
      to users. --dump or whatever

    # shouldn't be that hard actually
    - start trying to dump 'mid execution' Playbook
     - ie, latest values of vars
      - any new vars
      - any new blocks/tasks/plays/roles

    # harder, but opens up arch even more
    - after everything is safe and non-canonical yaml, maybe repeat with:
      - json
      - pickle
      - repr
      - maybe str


- logging
  - I can dream.
    - at least setup a logger correctly
    - and use debug()/exception()
    - playbook, inv, host,group, etc serialize/deserialize support
    - gh bugs for display log
        - logger name
           - not standard
           - not useful/hard to predict/misused
         - not using %(process)s
         - log file level tied to cli/display level
         - doesn't create logger unless using log file
            - nothing else can attach a handler to the display log
         - only uses two log levels

- ansible callbacks
    - freeze current callback api

    - add a new one with better versioning/introspection
    - make sure new interface is more clearly display/progress callbacks
      and not api hooks or entry points (ie, internals are 'ro')

    - split single callback interface into smaller composable
      parts
      - Task callbacks
      - Play callbacks
      - Playbook callbacks
      - Handler callbacks (see 'ansible handlers'
      - ansible process/instance/run lifetime callbacks
        - app startup
        - inventory load
        - etc

    - MAYBE: try adding 'rw' hook/slot/callback API entrypoints
      - yum plugin-ish
        - not a great example of maintainable interfaces, but it
          is/was a very flexible/powerful approach

- make DataLoader pluggable
    - split into DataReader and DataDeserializer
       - FileReader, VaultFileReader
       - yaml/json
       - reuse/share some/more inventory code?

- ansible payload proposal
  - make module_common and the bits of executor/ that build anziballs a seperate cli
    - or at least decouple the code
  - build modules based on target platform/runtime/versions
    - ie, python-posix-ansible-2.4 or powershell-windows-ansible-2.4 or golang-posix-ansible-2.5
  - make it easier to do things like add PEX support

- move base.Base._post_validate logic into FieldAttributes subclasses?


- better serialization/dump/yaml of playbook/
    # continue from above
    # getting pretty darn hard
    - then maybe try supporting dumping back to 'original' playbook form
      - means tracking extra info
      - dir/filenames of includes

    # very unlikely
    - then maybe trying supporting dumping back to 'original, pre templating' playbook form
      - means tracking the source of vars and parent templates
      - means serializing jinja template objects, if that is a thing


- ansible handlers
  - per block handlers
  - fail/error/changed/skipped handlers
  - per host handlers
    - or pass some 'user data' obj/ref to handlers with
      extra info (like the host name, or error info, etc)
  - support a generic handler that matches all notifies
    - mostly for debugging
  - implicit handlers
    - pre/post task
    - fail/changed/skipped mentioned above
    - task would always notify/emit 'task done handler' etc
      - default would be no handlers
    - add handler specific hooks to callback plugins
      - v3_on_handler_called
      - v3_on_handler_ok
      - v3_on_handler_error


    # this is gobject or DOM style property notifications. Non trivial, but super useful.
    - let varmanager emit handler notifies
      - tasks/plays/roles/etc using a set of vars could
        set 'listen' for varmanager change notifies
        - ie, like GObject 'properties' and prop change signals
          or web browser DOM 'mutationObservers'
        - set_fact: blip='foobar'
           - would 'notify' a 'facts_blip.changed' handler
           - if there is a handler listening for 'facts_blip.changed', it would
             get notified and run at next approriate time (idle loop-ish)
           - if handlers are per block/task/play/playbook/role, then each could have
             a handler listening for 'facts_blip.changed'
             - block could ignore it and let it propagate
             - play would catch it, handle it (say, restart a service for classic example) and
               stop propagating it
             - if play doesn't handle it, propagate to playbook
             - ... then onto global
             - ... then onto universal persistent handler? (ie, tower etc)
      - handling changing vars event driven would allow for setting/changing global
        semi-immutable vars (like inventory)
        - ie, queue var change, idle loop, pop it, change it, queue 'changed' signal
          - then next (concurent-ish) var change is queued, idle loop, popped, change and emit 'changed'
          - any block with (implicit, default) change handlers would handle changed signals before
            using there local var closure
  - possible impls?
    - strategy checks for task result _ansible_notify
    - task executor sets _ansible_notify from Task 'notify' field attribute
    - strategy only does handlers on success and on 'changed'
      - extract the handler running code to method (deep in strategy _process_pending_results)
        - amongst other things, this is also where 'handler hierarchy resolution' is handled (ie, role or play or global)
      - param the result field for handler names ('_ansible_notify')
         - each task result stanza could check for its handlers. ie, failed would _get_handlers(name='_ansible_failed_notify')
         - handle ok/failed/skipped/unreachable * changed/notchanged
      - add_host/add_group/diff etc as internal implicit handlers?
    - need a Role like HandlerDef to have a ds for handler/listen args ie
      - notify:
          - some_task:
              src: foo
              dest: /bar
            register: some_task_result
            notify:
              - my_blip_handler:
                  host: the_other_machine
                  result: some_task_result
              - restart_a_service_or_whatever:
                   svc: httpd

- ansible update/partial results
  - see 'update_json' for one approach
  - would be nice to have more connection channels for 'out of band' control/updates/partial results:
    - would like to avoid:
        - multiplexing multiple 'channels' to just stdout/stderr
        - having to do locking around output streams to avoid corrupt messages
        - having to do any sort of 'escape' from stdout stream
          - for ex, if random module writes out the same format as proposed json updates
        - having to do any additional parsing of stdout
          - better would be to be able to get rid of some filter_non_json kind of things
  - related: see module_log branch for returning log records as json


Troubleshooting / Support tools
Better logging support
Ansible core does not really use logging. There are some bits of display that can also log to a log file, but it has a lot of problems
install/env collection tools
Ala ‘sosreport’ or similar tools
Collect
where/how ansible is installed
Python modules used
Configuration
Env
Info about external tools used
Ssh
Local and remote config
Logs if possible
Sudo/su etc
Shell type/version
Ansible related system logging
Could be playbook/role based


End notes

1. fatal: [testhost]: FAILED! => {
	"failed": true,
	"msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined.The error was: 'test' is undefined\n\nThe error appears to have been in '/root/ansible/test/integration/targets/any_errors_fatal/test_fatal.yml': line 7, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- shell: \"echo {{ test }}\"\n  ^ here\nWe could be wrong, but this one looks like it might be an issue with\nmissing quotes.  Always quote template expression brackets when they\nstart a value. For instance:\n\n	with_items:\n  	- {{ foo }}\n\nShould be written as:\n\n	with_items:\n  	- \"{{ foo }}\"\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: 'test' is undefined"
}.   Huh?


	Common problem areas

	Ssh

	Errors from ‘ssh’ command are obscure
	But not meaningless, so they could be processed into more useful ansible error messages
	Early versions of https://github.com/ansible/ansible/pull/17598
	https://github.com/ansible/ansible/pull/16649

	Default verbosity level hides useful troubleshooting info
	Ie, no stderr
	Even at -vvv or higher, the ssh stderr output is presented in a unreadable way by default callbacks
	controlMaster/controlPersist also hides/obscures ssh troubleshooting info

	No easy way to temporarily disable it
	No easy way to collect ssh config info from ansible
	‘ssh -G’
	- Running ‘ssh’ from the cli is not equivalent to how it is invoked from ansible
	- Even when cut & pasting the ssh command line shown at higher verbosity levels (doesn’t take env vars or user config into account)
	https://github.com/ansible/ansible/pull/23241

	- Playbook include unpredictability/complexity
	- No one understands what a ‘static’ include is or why it causes/fixes problems
	- Conditionals and handlers on or in includes are very confusing
	- Pretty much no one understands the interactions
	- Lots of unexpected include behavior filed as issues
	- Undefined var behavior
	- Obtuse error messages [1]
	- Variable precedence
	- Figuring out why a variable ended up with a particular value is almost impossible
	- Proposal: https://github.com/ansible/ansible/compare/devel...alikins:varman_show_precedence (example output: https://gist.github.com/alikins/405352d8521ce8792fc2f72fde26f9ef)
	YAML errors
	- Current yaml errors are pretty good, but they are often still really vague. And until https://github.com/ansible/ansible/pull/24468 was recently merged, often wrong.
	- Jinja2/Templating debugging

	- ansible var scope/precedence info [prototyped: varman_show_precedence branch]
	- see var_man_changed_show_change branch
	- add scope/scope_info to playbook Base (Block?)
	- scope label ('role_vars', 'extra_vars', etc)
	- scope info
	- task/role/etc name
	- file path of file var came from ('group_vars/all', 'role/myrole/defaults/main.yml', etc)
	- various 'get a dict of vars from some source' bits called in vars.manager.Manager.get_vars() could
	return the info
	- as atts on the object?
	- as internal/magic dict items? ('_ansible_scope_label', etc)


	- vault extensibility

	- further split envelope/wrappers out
	- plugins for vault secrets
	- plugins for vault ciphers
	- figure out what we need to enable PKI based tools (ie, gpg)
	- make password rounds configurable
	- likely needs to add the number of rounds to envelope format

	- vault edit inline vault
	- parse yaml
	- decrypt !vault-encrypted
	- replace with !vault-plaintext
	- parse saved yaml
	- encrypted !vault-plaintext
	- save with encrypted values
	- can't serialize to yaml
	- need to do text munging

	- better serialization/dump/yaml of playbook related objects

	Goal is to make the playbook objects easier to debug and troubleshoot for users and developers.

	# easier, more useful parts

	- for dumping parsed/compiled Playbook object as it exist before execution
	- start with leaf nodes (FieldAttributes)
	- get unsafe canonical yaml working
	- get safe canonical yaml working
	- get safe non-canonical yaml working (ie, output that looks like a playbook)
	- proceed up the tree
	- make sure container/list type objects know
	to serialize their contents
	- add debug/troubleshooting hooks for displaying/presenting/persisting this info
	to users. --dump or whatever

	# shouldn't be that hard actually
	- start trying to dump 'mid execution' Playbook
	- ie, latest values of vars
	- any new vars
	- any new blocks/tasks/plays/roles

	# harder, but opens up arch even more
	- after everything is safe and non-canonical yaml, maybe repeat with:
	- json
	- pickle
	- repr
	- maybe str


	- logging
	- I can dream.
	- at least setup a logger correctly
	- and use debug()/exception()
	- playbook, inv, host,group, etc serialize/deserialize support
	- gh bugs for display log
	- logger name
	- not standard
	- not useful/hard to predict/misused
	- not using %(process)s
	- log file level tied to cli/display level
	- doesn't create logger unless using log file
	- nothing else can attach a handler to the display log
	- only uses two log levels

	- ansible callbacks
	- freeze current callback api

	- add a new one with better versioning/introspection
	- make sure new interface is more clearly display/progress callbacks
	and not api hooks or entry points (ie, internals are 'ro')

	- split single callback interface into smaller composable
	parts
	- Task callbacks
	- Play callbacks
	- Playbook callbacks
	- Handler callbacks (see 'ansible handlers'
	- ansible process/instance/run lifetime callbacks
	- app startup
	- inventory load
	- etc

	- MAYBE: try adding 'rw' hook/slot/callback API entrypoints
	- yum plugin-ish
	- not a great example of maintainable interfaces, but it
	is/was a very flexible/powerful approach

	- make DataLoader pluggable
	- split into DataReader and DataDeserializer
	- FileReader, VaultFileReader
	- yaml/json
	- reuse/share some/more inventory code?

	- ansible payload proposal
	- make module_common and the bits of executor/ that build anziballs a seperate cli
	- or at least decouple the code
	- build modules based on target platform/runtime/versions
	- ie, python-posix-ansible-2.4 or powershell-windows-ansible-2.4 or golang-posix-ansible-2.5
	- make it easier to do things like add PEX support

	- move base.Base._post_validate logic into FieldAttributes subclasses?


	- better serialization/dump/yaml of playbook/
	# continue from above
	# getting pretty darn hard
	- then maybe try supporting dumping back to 'original' playbook form
	- means tracking extra info
	- dir/filenames of includes

	# very unlikely
	- then maybe trying supporting dumping back to 'original, pre templating' playbook form
	- means tracking the source of vars and parent templates
	- means serializing jinja template objects, if that is a thing


	- ansible handlers
	- per block handlers
	- fail/error/changed/skipped handlers
	- per host handlers
	- or pass some 'user data' obj/ref to handlers with
	extra info (like the host name, or error info, etc)
	- support a generic handler that matches all notifies
	- mostly for debugging
	- implicit handlers
	- pre/post task
	- fail/changed/skipped mentioned above
	- task would always notify/emit 'task done handler' etc
	- default would be no handlers
	- add handler specific hooks to callback plugins
	- v3_on_handler_called
	- v3_on_handler_ok
	- v3_on_handler_error


	# this is gobject or DOM style property notifications. Non trivial, but super useful.
	- let varmanager emit handler notifies
	- tasks/plays/roles/etc using a set of vars could
	set 'listen' for varmanager change notifies
	- ie, like GObject 'properties' and prop change signals
	or web browser DOM 'mutationObservers'
	- set_fact: blip='foobar'
	- would 'notify' a 'facts_blip.changed' handler
	- if there is a handler listening for 'facts_blip.changed', it would
	get notified and run at next approriate time (idle loop-ish)
	- if handlers are per block/task/play/playbook/role, then each could have
	a handler listening for 'facts_blip.changed'
	- block could ignore it and let it propagate
	- play would catch it, handle it (say, restart a service for classic example) and
	stop propagating it
	- if play doesn't handle it, propagate to playbook
	- ... then onto global
	- ... then onto universal persistent handler? (ie, tower etc)
	- handling changing vars event driven would allow for setting/changing global
	semi-immutable vars (like inventory)
	- ie, queue var change, idle loop, pop it, change it, queue 'changed' signal
	- then next (concurent-ish) var change is queued, idle loop, popped, change and emit 'changed'
	- any block with (implicit, default) change handlers would handle changed signals before
	using there local var closure
	- possible impls?
	- strategy checks for task result _ansible_notify
	- task executor sets _ansible_notify from Task 'notify' field attribute
	- strategy only does handlers on success and on 'changed'
	- extract the handler running code to method (deep in strategy _process_pending_results)
	- amongst other things, this is also where 'handler hierarchy resolution' is handled (ie, role or play or global)
	- param the result field for handler names ('_ansible_notify')
	- each task result stanza could check for its handlers. ie, failed would _get_handlers(name='_ansible_failed_notify')
	- handle ok/failed/skipped/unreachable * changed/notchanged
	- add_host/add_group/diff etc as internal implicit handlers?
	- need a Role like HandlerDef to have a ds for handler/listen args ie
	- notify:
	- some_task:
	src: foo
	dest: /bar
	register: some_task_result
	notify:
	- my_blip_handler:
	host: the_other_machine
	result: some_task_result
	- restart_a_service_or_whatever:
	svc: httpd

	- ansible update/partial results
	- see 'update_json' for one approach
	- would be nice to have more connection channels for 'out of band' control/updates/partial results:
	- would like to avoid:
	- multiplexing multiple 'channels' to just stdout/stderr
	- having to do locking around output streams to avoid corrupt messages
	- having to do any sort of 'escape' from stdout stream
	- for ex, if random module writes out the same format as proposed json updates
	- having to do any additional parsing of stdout
	- better would be to be able to get rid of some filter_non_json kind of things
	- related: see module_log branch for returning log records as json



	Troubleshooting / Support tools
	Better logging support
	Ansible core does not really use logging. There are some bits of display that can also log to a log file, but it has a lot of problems
	install/env collection tools
	Ala ‘sosreport’ or similar tools
	Collect
	where/how ansible is installed
	Python modules used
	Configuration
	Env
	Info about external tools used
	Ssh
	Local and remote config
	Logs if possible
	Sudo/su etc
	Shell type/version
	Ansible related system logging
	Could be playbook/role based




	End notes

	1. fatal: [testhost]: FAILED! => {
	"failed": true,
	"msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined.The error was: 'test' is undefined\n\nThe error appears to have been in '/root/ansible/test/integration/targets/any_errors_fatal/test_fatal.yml': line 7, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- shell: \"echo {{ test }}\"\n ^ here\nWe could be wrong, but this one looks like it might be an issue with\nmissing quotes. Always quote template expression brackets when they\nstart a value. For instance:\n\n with_items:\n - {{ foo }}\n\nShould be written as:\n\n with_items:\n - \"{{ foo }}\"\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: 'test' is undefined"
	}. Huh?