devarsht/vpu_poll_range_diff_v2

## vpu_poll_range_diff_v2
1:  9dff0110f466 ! 1:  10eda94f7fba media: chips-media: wave5: Add hrtimer based polling support
    @@ Metadata
      ## Commit message ##
         media: chips-media: wave5: Add hrtimer based polling support

    -    Add support for starting a polling timer in case interrupt is not
    -    available. This helps keep the VPU functional in SoC's such as AM62A, where
    -    the hardware interrupt hookup may not be present due to an SoC errata [1].
    +    Add support for starting a polling timer in case an interrupt is not
    +    available. This helps to keep the VPU functional in SoCs such as AM62A,
    +    where the hardware interrupt hookup may not be present due to an SoC errata
    +    [1].

    -    The timer is shared across all instances of encoder and decoder and is
    -    started when first instance of encoder or decoder is opened and stopped
    -    when last instance is closed, thus avoiding per instance polling and saving
    -    CPU bandwidth.
    +    The timer is shared across all instances of encoders and decoders and is
    +    started when the first instance of an encoder or decoder is opened and
    +    stopped when the last instance is closed, thus avoiding per instance
    +    polling and saving CPU bandwidth. As VPU driver manages this instance
    +    related tracking and synchronization, the aforementioned shared timer
    +    related polling logic is implemented within the VPU driver itself. This
    +    scheme may also be useful in general too (even if irq is present) for
    +    non-realtime multi-instance VPU use-cases (for e.g 32 instances of VPU
    +    being run together) where system is running already under high interrupt
    +    load and switching to polling may help mitigate this as the polling thread
    +    is shared across all the VPU instances.

    -    hrtimer callback is called with 5ms polling interval while any of the
    -    encoder/decoder instances are running to check the interrupt status as
    -    being done in irq handler.
    +    Hrtimer is chosen for polling here as it provides precise timing and
    +    scheduling and the API seems better suited for periodic polling task such
    +    as this.  As a general rule of thumb,

    -    Based on above interrupt status, use a worker thread to iterate over the
    -    interrupt status for each instance and send completion event as being done
    -    in irq thread function.
    +    Worst case latency with hrtimer = Actual latency (achievable with irq)
    +                                      + Polling interval
    +
    +    NOTE (the meaning of terms used above is as follows):
    +    - Latency: Time taken to process one frame
    +    - Actual Latency : Time taken by hardware to process one frame and signal
    +      it to OS (i.e. if latency that was possible to achieve if irq line was
    +    present)
    +
    +    There is a trade-off between latency and CPU usage when deciding the value
    +    for polling interval. With aggressive polling intervals (i.e. going with
    +    even lesser values) the CPU usage increases although worst case latencies
    +    get better. On the contrary, with greater polling intervals worst case
    +    latencies will increase although the CPU usage will decrease.
    +
    +    The 5ms offered a good balance between the two as we were able to reach
    +    close to actual latencies (as achievable with irq) without incurring too
    +    much of CPU as seen in below experiments and thus 5ms is chosen as default
    +    polling interval.

    -    Parse for irq number before v4l2 device registration and if not available
    -    only then, initialize hrtimer and worker thread.
    +    - 1x 640x480@25 Encoding using different hrtimer polling intervals [2]
    +    - 4x 1080p30 Transcode (File->decode->encode->file) irq vs polling
    +      comparison [3]
    +    - 1x 1080p Transcode (File->decode->encode->file) irq vs polling comparison
    +      [4]
    +    - 1080p60 Streaming use-case irq vs polling comparison [5]
    +    - 1x 1080p30 sanity decode and encode tests [6]

    -    Move the core functionality of irq thread function to a separate function
    -    wave5_vpu_handle_irq so that it can be used by both the worker thread when
    -    using polling mode and irq thread when using interrupt mode.
    +    The polling interval can also be changed using vpu_poll_interval module
    +    param in case user want to change it as per their use-case requirement
    +    keeping in mind above trade-off.

    -    Protect hrtimer access and instance list with device specific mutex locks
    -    to avoid race conditions while different instances of encoder and decoder
    -    are started together.
    +    Based on interrupt status, we use a worker thread to iterate over the
    +    interrupt status for each instance and send completion event as being done
    +    in irq thread function.
    +
    +    Move the core functionality of the irq thread function to a separate
    +    function wave5_vpu_handle_irq so that it can be used by both the worker
    +    thread when using polling mode and irq thread when using interrupt mode.

    -    Add module param to change polling interval for debug purpose.
    +    Protect the hrtimer access and instance list with device specific mutex
    +    locks to avoid race conditions while different instances of encoder and
    +    decoder are started together.

         [1] https://www.ti.com/lit/pdf/spruj16
         (Ref: Section 4.2.3.3 Resets, Interrupts, and Clocks)
    +    [2] https://gist.github.com/devarsht/ee9664d3403d1212ef477a027b71896c
    +    [3] https://gist.github.com/devarsht/3a58b4f201430dfc61697c7e224e74c2
    +    [4] https://gist.github.com/devarsht/a6480f1f2cbdf8dd694d698309d81fb0
    +    [5] https://gist.github.com/devarsht/44aaa4322454e85e01a8d65ac47c5edb
    +    [6] https://gist.github.com/devarsht/2f956bcc6152dba728ce08cebdcebe1d

         Signed-off-by: Devarsh Thakkar <devarsht@ti.com>
         Tested-by: Jackson Lee <jackson.lee@chipsnmedia.com>
    +    ---
    +    V2:
    +    - Update commit message as suggested in review to give more context
    +      on design being chosen and analysis that was done to decide on same
    +    - Add Tested-By
    +
    +    Range diff w.r.t v1 :
    +    https://gist.github.com/devarsht/cd6bbb4ba90b0229be4718b7140ef924

      ## drivers/media/platform/chips-media/wave5/wave5-helper.c ##
     @@ drivers/media/platform/chips-media/wave5/wave5-helper.c: int wave5_vpu_release_device(struct file *filp,
    @@ drivers/media/platform/chips-media/wave5/wave5-helper.c: int wave5_vpu_release_d
      {
      	struct vpu_instance *inst = wave5_to_vpu_inst(filp->private_data);
     +	struct vpu_device *dev = inst->dev;
    -+	int ret = 0;
    ++	int ret;

      	v4l2_m2m_ctx_release(inst->v4l2_fh.m2m_ctx);
      	if (inst->state != VPU_INST_STATE_NONE) {
    + 		u32 fail_res;
    +-		int ret;
    +
    + 		ret = close_func(inst, &fail_res);
    + 		if (fail_res == WAVE5_SYSERR_VPU_STILL_RUNNING) {
     @@ drivers/media/platform/chips-media/wave5/wave5-helper.c: int wave5_vpu_release_device(struct file *filp,
      	}
	1: 9dff0110f466 ! 1: 10eda94f7fba media: chips-media: wave5: Add hrtimer based polling support
	@@ Metadata
	## Commit message ##
	media: chips-media: wave5: Add hrtimer based polling support

	- Add support for starting a polling timer in case interrupt is not
	- available. This helps keep the VPU functional in SoC's such as AM62A, where
	- the hardware interrupt hookup may not be present due to an SoC errata [1].
	+ Add support for starting a polling timer in case an interrupt is not
	+ available. This helps to keep the VPU functional in SoCs such as AM62A,
	+ where the hardware interrupt hookup may not be present due to an SoC errata
	+ [1].

	- The timer is shared across all instances of encoder and decoder and is
	- started when first instance of encoder or decoder is opened and stopped
	- when last instance is closed, thus avoiding per instance polling and saving
	- CPU bandwidth.
	+ The timer is shared across all instances of encoders and decoders and is
	+ started when the first instance of an encoder or decoder is opened and
	+ stopped when the last instance is closed, thus avoiding per instance
	+ polling and saving CPU bandwidth. As VPU driver manages this instance
	+ related tracking and synchronization, the aforementioned shared timer
	+ related polling logic is implemented within the VPU driver itself. This
	+ scheme may also be useful in general too (even if irq is present) for
	+ non-realtime multi-instance VPU use-cases (for e.g 32 instances of VPU
	+ being run together) where system is running already under high interrupt
	+ load and switching to polling may help mitigate this as the polling thread
	+ is shared across all the VPU instances.

	- hrtimer callback is called with 5ms polling interval while any of the
	- encoder/decoder instances are running to check the interrupt status as
	- being done in irq handler.
	+ Hrtimer is chosen for polling here as it provides precise timing and
	+ scheduling and the API seems better suited for periodic polling task such
	+ as this. As a general rule of thumb,

	- Based on above interrupt status, use a worker thread to iterate over the
	- interrupt status for each instance and send completion event as being done
	- in irq thread function.
	+ Worst case latency with hrtimer = Actual latency (achievable with irq)
	+ + Polling interval
	+
	+ NOTE (the meaning of terms used above is as follows):
	+ - Latency: Time taken to process one frame
	+ - Actual Latency : Time taken by hardware to process one frame and signal
	+ it to OS (i.e. if latency that was possible to achieve if irq line was
	+ present)
	+
	+ There is a trade-off between latency and CPU usage when deciding the value
	+ for polling interval. With aggressive polling intervals (i.e. going with
	+ even lesser values) the CPU usage increases although worst case latencies
	+ get better. On the contrary, with greater polling intervals worst case
	+ latencies will increase although the CPU usage will decrease.
	+
	+ The 5ms offered a good balance between the two as we were able to reach
	+ close to actual latencies (as achievable with irq) without incurring too
	+ much of CPU as seen in below experiments and thus 5ms is chosen as default
	+ polling interval.

	- Parse for irq number before v4l2 device registration and if not available
	- only then, initialize hrtimer and worker thread.
	+ - 1x 640x480@25 Encoding using different hrtimer polling intervals [2]
	+ - 4x 1080p30 Transcode (File->decode->encode->file) irq vs polling
	+ comparison [3]
	+ - 1x 1080p Transcode (File->decode->encode->file) irq vs polling comparison
	+ [4]
	+ - 1080p60 Streaming use-case irq vs polling comparison [5]
	+ - 1x 1080p30 sanity decode and encode tests [6]

	- Move the core functionality of irq thread function to a separate function
	- wave5_vpu_handle_irq so that it can be used by both the worker thread when
	- using polling mode and irq thread when using interrupt mode.
	+ The polling interval can also be changed using vpu_poll_interval module
	+ param in case user want to change it as per their use-case requirement
	+ keeping in mind above trade-off.

	- Protect hrtimer access and instance list with device specific mutex locks
	- to avoid race conditions while different instances of encoder and decoder
	- are started together.
	+ Based on interrupt status, we use a worker thread to iterate over the
	+ interrupt status for each instance and send completion event as being done
	+ in irq thread function.
	+
	+ Move the core functionality of the irq thread function to a separate
	+ function wave5_vpu_handle_irq so that it can be used by both the worker
	+ thread when using polling mode and irq thread when using interrupt mode.

	- Add module param to change polling interval for debug purpose.
	+ Protect the hrtimer access and instance list with device specific mutex
	+ locks to avoid race conditions while different instances of encoder and
	+ decoder are started together.

	[1] https://www.ti.com/lit/pdf/spruj16
	(Ref: Section 4.2.3.3 Resets, Interrupts, and Clocks)
	+ [2] https://gist.github.com/devarsht/ee9664d3403d1212ef477a027b71896c
	+ [3] https://gist.github.com/devarsht/3a58b4f201430dfc61697c7e224e74c2
	+ [4] https://gist.github.com/devarsht/a6480f1f2cbdf8dd694d698309d81fb0
	+ [5] https://gist.github.com/devarsht/44aaa4322454e85e01a8d65ac47c5edb
	+ [6] https://gist.github.com/devarsht/2f956bcc6152dba728ce08cebdcebe1d

	Signed-off-by: Devarsh Thakkar <devarsht@ti.com>
	Tested-by: Jackson Lee <jackson.lee@chipsnmedia.com>
	+ ---
	+ V2:
	+ - Update commit message as suggested in review to give more context
	+ on design being chosen and analysis that was done to decide on same
	+ - Add Tested-By
	+
	+ Range diff w.r.t v1 :
	+ https://gist.github.com/devarsht/cd6bbb4ba90b0229be4718b7140ef924

	## drivers/media/platform/chips-media/wave5/wave5-helper.c ##
	@@ drivers/media/platform/chips-media/wave5/wave5-helper.c: int wave5_vpu_release_device(struct file *filp,
	@@ drivers/media/platform/chips-media/wave5/wave5-helper.c: int wave5_vpu_release_d
	{
	struct vpu_instance *inst = wave5_to_vpu_inst(filp->private_data);
	+ struct vpu_device *dev = inst->dev;
	-+ int ret = 0;
	++ int ret;

	v4l2_m2m_ctx_release(inst->v4l2_fh.m2m_ctx);
	if (inst->state != VPU_INST_STATE_NONE) {
	+ u32 fail_res;
	+- int ret;
	+
	+ ret = close_func(inst, &fail_res);
	+ if (fail_res == WAVE5_SYSERR_VPU_STILL_RUNNING) {
	@@ drivers/media/platform/chips-media/wave5/wave5-helper.c: int wave5_vpu_release_device(struct file *filp,
	}