[OpenPOWER-HDL-Cores] [Libre-soc-dev] microwatt / libresoc dcache
Luke Kenneth Casson Leighton
lkcl at lkcl.net
Fri May 7 02:54:49 UTC 2021
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
On Fri, May 7, 2021 at 3:31 AM Paul Mackerras <paulus at ozlabs.org> wrote:
> Right, that's something we need to fix throughout microwatt.
has to be done in one hit, for hints search "r1.wb.adr" in here:
should be readable and the similarity clear.
> > * AGEN (address generation)
> > * ST data drop
> > * actual fetch.
> The 2nd cycle does TLB and cache tag matching. I'm not sure exactly
> what "ST data drop" is;
i mean "store data is dropped in". there are code comments saying
"place the store date in one cycle after putting the address in",
something like that.
> So stores can't be issued until all the operands are available; makes
means we have to do some quite extensive modifications to dcache.py's
FSM, adding a latch for "has AGEN been done", and one for "has store
register been received", and only if the two are true can the dcache
store be issued.
complicated and fun :)
> > a normal SRAM you would expect a 1 clock cycle delay, all good. except
> The VHDL construct ram(to_integer(unsigned(rd_addr))) doesn't of
> itself imply a clock edge; it's like a combinatorial RAM not a
> synchronous RAM. (Imagine a bunch of flip-flops connected to the data
> inputs of a multiplexer whose address input is rd_addr.) Putting that
> inside a process(clk) begin if rising_edge(clk) then ... construct
> makes that a 1-cycle synchronous RAM.
yes... starting at line 44.
if rising_edge(clk) then
if rd_en = '1' then
rd_data0 <= ram(to_integer(unsigned(rd_addr)));
however look at line 70, there's *another* rising_edge block:
buf: if ADD_BUF generate
if rising_edge(clk) then
rd_data <= rd_data0;
and rd_data is declared as a *signal* at line 45, not a variable:
signal rd_data0 : std_logic_vector(WIDTH - 1 downto 0);
as best i can tell, of reading VHDL, that means that when ADD_BUF=true
there is not a one-clock delay on rd_data output, there is a *two*
or, i just simply don't know what "<=" in VHDL does when it involves
signals going through other signals. however given that forward1_data
and forward2_data exhibit the same pattern, and have documented
comments explaining their purpose, i *believe* i am making a correct
inference about VHDL syntax.
> > here, an *extra* cycle of delay is added. after assertion of the read it
> > is *two* cycles before the data appears on the read data output.
> I think you're attributing a cycle of delay to the ram() construct,
> which it doesn't have.
ah i see where the confusion might be: no, i'm not talking about the
ram() construct, i'm referring to cache_ram.vhdl.
> The dcache definitely does writeback two
> cycles after address generation; I have traces showing that.
i'm not referring to writeback: i'm referring to the actual output -
d_out (the output from dcache.vhdl).
> We do manage to get from the register at the output of the dcache RAMs
> all the way to the data input of the register file RAM in one cycle,
> which is a bit of a stretch, and at higher frequencies would need more
> pipeline stages.
interesting. a valuable insight i will bear in mind, given that we
intend (down the line) to target ASICs at 2 ghz.
> The way it is now, the data and the way number arrive at the same
> time (at the start of the third cycle) and go into the way select
> multiplexer. Having the data arrive a cycle earlier wouldn't help all
> that much since we would have to latch it until the way number
these are valuable insights to understanding the code, i think we're
talking at cross purposes though. by noticing that you thought i was
talking about the ram() construct when i was referring instead to
cache_ram.vhdl itself, that has been cleared up, and my follow-up
message with a strategy to reduce the latency of output signals from
dcache.vhdl is clear?
More information about the OpenPOWER-HDL-Cores