Change "done" mechanism to be tolerant to Single Event Upsets.#17
Change "done" mechanism to be tolerant to Single Event Upsets.#17Lynx005F wants to merge 1 commit intopulp-platform:masterfrom
Conversation
ef1663b to
08fffa0
Compare
Smephite
left a comment
There was a problem hiding this comment.
The proposed solution to make the done signal assert two cycles, creates an issue with the context counters.
Two cycles will increase/decrease the counters twice, thus leading to a potential over/underflow.
This in turn will lock up the accelerator if not reset in between calculations.
I suggest to add a positive edge detection before in-/decrementing the counters.
I am uncertain on the effect of the edge detection on the reliability, as it once again creates a SEE susceptible signal. An alternative approach could be to double up the counter / decrease it with every done signal only once.
|
I haven't looked at this code for a long time, so I don't remember the exact details, but I think there is two different levels of vulnerability here:
I would propose to fix the first and more common vulnerability and keep the smaller one for now e.g. just having a basic edge detector like this: logic done_q
...
regfile_flags.true_done = ctrl_i.done & ~done_q;
...
// FF to make flags_o.done a pulse without being SEU vulnerable or depending on internal state
always_ff @(posedge clk_i or negedge rst_ni)
begin
if(!rst_ni) begin
done <= 1'b1;
end
else begin
done_q <= ctrl_i.done;
end
endUnfortunately I am no longer at IIS so I can't simulate this easily if it works as intended. If you can move this edge detector closer to your counters that might be a way to reduce SEU vulnerability a bit more as well. |
This changes how the
true_doneflag is calculated:true_donemight never be set and as such the FSM can stall the accelerator. At the same timetrue_donecan not just bedonefrom the input since that might be set on reset.To solve this, make
true_doneassert on the rising edge ofdoneinput.(A fault-tolerant accelerator should continuously asserts
doneand then has the guarantee that this will eventually be forwarded).true_doneoutput itself might also experience a single event upset in just the cycle where it is asserted and thus done signal is destroyed. To mitigate this extend the above mechanism to assert the output for two cycles at minimum.This does not add any protection in the other direction e.g. an SEU causing an abort when the accelerator is in fact doing fine.