[OpenPOWER-HDL-Cores] [Libre-soc-dev] microwatt / libresoc dcache

Luke Kenneth Casson Leighton lkcl at lkcl.net
Fri May 7 02:54:49 UTC 2021


---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Fri, May 7, 2021 at 3:31 AM Paul Mackerras <paulus at ozlabs.org> wrote:

> Right, that's something we need to fix throughout microwatt.

has to be done in one hit, for hints search "r1.wb.adr" in here:
https://git.libre-soc.org/?p=soc.git;a=blob;f=src/soc/experiment/dcache.py;hb=HEAD

should be readable and the similarity clear.

> > * AGEN (address generation)
> > * ST data drop
> > * actual fetch.
>
> The 2nd cycle does TLB and cache tag matching.  I'm not sure exactly
> what "ST data drop" is;

i mean "store data is dropped in".  there are code comments saying
"place the store date in one cycle after putting the address in",
something like that.

> So stores can't be issued until all the operands are available; makes
> sense.

means we have to do some quite extensive modifications to dcache.py's
FSM, adding a latch for "has AGEN been done", and one for "has store
register been received", and only if the two are true can the dcache
store be issued.

complicated and fun :)

> > a normal SRAM you would expect a 1 clock cycle delay, all good.  except
>
> The VHDL construct ram(to_integer(unsigned(rd_addr))) doesn't of
> itself imply a clock edge; it's like a combinatorial RAM not a
> synchronous RAM.  (Imagine a bunch of flip-flops connected to the data
> inputs of a multiplexer whose address input is rd_addr.)  Putting that
> inside a process(clk) begin if rising_edge(clk) then ... construct
> makes that a 1-cycle synchronous RAM.

yes... starting at line 44.
https://github.com/antonblanchard/microwatt/blob/master/cache_ram.vhdl#L44

    process(clk)
    ...
    begin
        if rising_edge(clk) then
            ...
            if rd_en = '1' then
                rd_data0 <= ram(to_integer(unsigned(rd_addr)));
            end if;
        end if;
    end process;

however look at line 70, there's *another* rising_edge block:

    buf: if ADD_BUF generate
    begin
        process(clk)
        begin
            if rising_edge(clk) then
                rd_data <= rd_data0;
            end if;
        end process;

and rd_data is declared as a *signal* at line 45, not a variable:

    signal rd_data0 : std_logic_vector(WIDTH - 1 downto 0);

as best i can tell, of reading VHDL, that means that when ADD_BUF=true
there is not a one-clock delay on rd_data output, there is a *two*
clock delay.

or, i just simply don't know what "<=" in VHDL does when it involves
signals going through other signals.  however given that forward1_data
and forward2_data exhibit the same pattern, and have documented
comments explaining their purpose, i *believe* i am making a correct
inference about VHDL syntax.

> > here, an *extra* cycle of delay is added.  after assertion of the read it
> > is *two* cycles before the data appears on the read data output.
>
> I think you're attributing a cycle of delay to the ram() construct,
> which it doesn't have.

ah i see where the confusion might be: no, i'm not talking about the
ram() construct, i'm referring to cache_ram.vhdl.

> The dcache definitely does writeback two
> cycles after address generation; I have traces showing that.

i'm not referring to writeback: i'm referring to the actual output -
d_out (the output from dcache.vhdl).

> We do manage to get from the register at the output of the dcache RAMs
> all the way to the data input of the register file RAM in one cycle,
> which is a bit of a stretch, and at higher frequencies would need more
> pipeline stages.

interesting.  a valuable insight i will bear in mind, given that we
intend (down the line) to target ASICs at 2 ghz.

> The way it is now, the data and the way number arrive at the same
> time (at the start of the third cycle) and go into the way select
> multiplexer.  Having the data arrive a cycle earlier wouldn't help all
> that much since we would have to latch it until the way number
> arrives.

these are valuable insights to understanding the code, i think we're
talking at cross purposes though. by noticing that you thought i was
talking about the ram() construct when i was referring instead to
cache_ram.vhdl itself, that has been cleared up, and my follow-up
message with a strategy to reduce the latency of output signals from
dcache.vhdl is clear?

best,

l.


More information about the OpenPOWER-HDL-Cores mailing list