brabect1/coding_rtl_for_phys_impl.rst

## coding_rtl_for_phys_impl.rst

      
    Raw
  

              coding_rtl_for_phys_impl.rst
            
          
    RTL Coding Tips for Easier Physical Implementation

Quite a few digital design engineers limit their activities to architecture design and RTL coding.
Missing first hand experience with later design phases is then easily detected by how much their
code and architecture complicates physical implementation. Areas where this surfaces most often
are clock/reset schemes, clock domain crossings (CDC) and scan/DFT aspects.

Gate Instantiations

Instantiate library cells (e.g. multiplexers, ANDs, etc.) that have RTL syntax equivalents,
that is the question. RTL independence of the target technology is one of the reasons RTL
exists. On the other hand, certain physical implementation aspects (e.g. timing constraint
definition) get easier when gates are manually instantiated.
A typical example is a clock multiplexer (MUX). In Verilog, it can be easily coded as:
assign mux_clk = sel ? b_clk : a_clk;

In SDC, you may need to define generated clocks. Something like:
create_clock a_clk -name A_CLK -period ...
create_clock b_clk -name B_CLK -period ...

create_generated_clock mux_inst/Q -name MUX_CLK_A -master A_CLK -source mux_inst/A -combinational
create_generated_clock mux_inst/Q -name MUX_CLK_B -master B_CLK -source mux_inst/B -combinational

So the first problem becomes the multiplexer gate name. With a netlist synthesized from RTL, combinational
gate instances are named arbitrarily (unlike e.g. sequential gates), such as g123, and the name may
change with a new synthesis run. In different cell libraries, gate pin names will differ too and a reusable
SDC would need to proof against the change. Hence the working SDC constraints change to something like:
# The assumption here is that the synthesis tool honors the net name coming out
# of the mux (i.e. `mux_clk`), which is usually so. Otherwise the SDC would have
# no anchor to identify the gate!

set mux_net [get_nets -hier mux_clk];
set mux_out [get_pins -quiet -leaf -of ${mux_net} -filter "pin_direction == out"];
set mux_in_a [...];
set mux_in_b [...];
create_generated_clock ${mux_out} -name ${mux_in_a} -master A_CLK -source mux_inst/A -combinational
create_generated_clock ${mux_out} -name ${mux_in_b} -master B_CLK -source mux_inst/B -combinational

The other problem is that the synthesis tool is free to choose any gates that fit the
combinational function; in particular, multiplexers get often synthesized into AND-OR-NOT
gates despite the library having a specialized MUX gate. Then getting the actual mux_in_a
and mux_in_b pins in a gate-type independent way becomes a grand exercise of SDC coding.
Now consider varying support and differences among SDC interpretters/tools.
Altogether, a very simple case (for an RTL designer) becomes a drag for a physical implementation
engineer. Rather than spending time on complex and fragile "generic" SDC code, it is far better
to put the target library cell into RTL:
mux2_4 mux_inst( .A(a_clk), .B(b_clk), .Q(mux_clk), .S(sel) );

If RTL reuse and independence is important (e.g. reusable soft IPs), the the option is to use
cell wrappers:
module my_ip( ... );
  ...
  mux2_wrap mux_inst( .A(a_clk), ... );
  ...
endmodule

// generic mux wrapper that can be distributed with the soft IP
module mux2_wrap ( input logic A, input logic B, input logic S, output logic Q);
  assign Q = S ? B : A;
endmodule

// `mylib` library cell wrapper (replaces the generic wrapper when synthesizing
// into `mylib` gates). Notice the wrapper also does pin translation so that
// SDC will always refer to the wrapper's hierarchical instance pins that remain
// the same.
module mux2_wrap( ... );
    mymux2_4 mux_inst( .D0(A), .D1(B), .Y(Q), .SEL(S) );
endmodule

Gate wrappers can themselves become a "soft IP" so that your regular soft IPs follow use
consistent wrapper naming and for each target library you develop gate wrappers only once.
There are other occasions beside MUXes where gate instantiations become helpful. In general,
all such occasions are identified by a need to identify the gate instance by name, be it
in intent files (SDC, UPF, ...) or EDA scripts for constraining and/or design manipulation.
Examples include various clock/reset/DFT/... gating gates, don't touch cells (e.g. input/output
buffers), symetric clock tree gates.

Pragmas (and Multiplexers)


Clock Dividers

A clock divider is what the name indicates; produces a new clock (of lower frequency) by dividing
an existing clock by a fixed, pre-defined factor. Consider a divider by two:
module div2( input logic i, output logic o, input logic rst );
  always @(posedge i or posedge rst) begin
    if (rst) o <= 1'b0;
    else o <= ~o;
  end
endmodule

module my_design( input logic clk, ... );
  logic clk_div2;
  div2 div2_inst( .i(clk), .o(clk_div2), .rst(...) );
  ...
endmodule


Note
Cascading div2 instances yields an efficient (from power/performance/area, PPA, perspective)
div-by-2^n divider. See STA Constraints of Asynchronous Counters
for specific timing constraints considerations for larger n's.

Non-2^n dividers get a bit more complex. Yet they all share the fact that the divided clock is
a flop output (for glitch prevention):
module div3( input logic i, output logic o, input logic rst );
  logic[1:0] cnt;

  always @(posedge i or posedge rst) begin
    if (rst) begin
      o <= 1'b0;
      cnt <= '0;
    end
    else if (cnt == 2'd2) begin
      o <= ~o;
      cnt <= '0;
    end
    else begin
      cnt <= cnt + 1'b1;
    end
  end
endmodule

From SDC perspective, every divided clock becomes a generated clock and needs to be constrained so:
create_clock clk -name CLK -period ...
create_generate_clock div3_inst/o_reg/Q -name clk_div3 -master clk -source div3_inst/o_reg/CK -divide_by 3

That is not too bad, is it? Well, generated clocks do not inherit master clock properties (e.g. clock
uncertainty) that you need to replicate manually. Also, any CDC exceptions that you need to specify for
the master clock would apply for the generated clock too. ... TODO example
Quite often, though, what RTL designers/architects really need is to have flops trigger once in a while.
For that case, clock gating is superior over the clock divider. For one reason, it is transparent to STA
and requires no explicit SDC constraints. Compare the two codes below. The main difference is the generated
clock shape/duty cycle, which is often no problem. There will be difference in STA, though; you need no
extra SDC constraints for clk_gated, yet STA would time all its flops with the clk period (rather
than by -divide_by 3 period), which may challenge the timing for too slow/complex logic.
module div_example( input logic clk, ...);
  logic[1:0] cnt;
  logic clk_div3;

  always @(posedge i or posedge rst) begin
    if (rst) begin
      clk_div3 <= 1'b0;
      cnt <= '0;
    end
    else if (cnt == 2'd2) begin
      clk_div3 <= ~clk_div3;
      cnt <= '0;
    end
    else begin
      cnt <= cnt + 1'b1;
    end
  end

  logic data;
  always @(posedge clk_div3 or posedge rst) begin
    if (rst) data <= '0;
    else data <= ...;
  end
endmodule

module cgc_example( input logic clk, ...);
  logic[1:0] cnt;
  logic clk_en;
  logic clk_gated;

  assign clk_en = (cnt == 2'd2);

  // Here we use AND-gating for simplicity, while in most designs
  // you would choose a latch-based Integrated Clock Gating (ICG) cell.
  assign clk_gated = clk_en & clk;

  always @(posedge i or posedge rst) begin
    if (rst)         cnt <= '0;
    else if (clk_en) cnt <= '0;
    else             cnt <= cnt + 1'b1;
  end

  logic data;
  always @(posedge clk_gated or posedge rst) begin
    if (rst) data <= '0;
    else data <= ...;
  end
endmodule

Note that FPGA designers would likely choose yet another coding style that with proper
synthesis settings (i.e. clock gate inferring) is equivalent to cgc_example. Hence
FPGA re-targeting/prototyping may also be point for choosing one architecture/coding style
over the other. This RTL code requires no extra SDC code as in cgc_example.
module en_example( input logic clk, ...);
  logic[1:0] cnt;
  logic clk_en;

  assign clk_en = (cnt == 2'd2);

  always @(posedge i or posedge rst) begin
    if (rst)         cnt <= '0;
    else if (clk_en) cnt <= '0;
    else             cnt <= cnt + 1'b1;
  end

  logic data;
  always @(posedge clk or posedge rst) begin
    if (rst) data <= '0;
    else if (clk_en) data <= ...;
  end
endmodule


Conclusions

RTL coding practices greatly impact complexity of SDC constraints and other physical implementation
aspects. Here are some recommendations that would make the physical implementation easier:

Prefer gate instantiations for:
All cells in the clock tree (e.g. clock muxes, AND/OR gates, ICG cells).
All multiplexers that yield a distinct operation mode (and hence a distinct SDC).


Consider library cell wrappers to isolate RTL changes from target library changes.
Avoid pragmas.
Wherever possible, change clock dividers for clock gaters/enables.