From 3ba94cf0958f0379d17e9e0eacd5ddb1b2354a3d Mon Sep 17 00:00:00 2001 From: Eugene Cheah Date: Mon, 18 Sep 2023 17:20:53 -0700 Subject: [PATCH] added disclaimer to architecture cascading pattern --- docs/advance/architecture.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/advance/architecture.md b/docs/advance/architecture.md index 172e2c5..079b9b5 100644 --- a/docs/advance/architecture.md +++ b/docs/advance/architecture.md @@ -34,7 +34,9 @@ The follow gif, illustrates the parallel cascading nature over approximately 3 l Effectively, allowing the RNN network to run like a transformer network, when rolled out side by side. Where it can be trained "like a transformer" and "executed like an RNN" (the best of both worlds) -All of this is achieved by using a combination of token shifting, channel and time mixing, to comptue the next layer / hidden state. +All of this while achieved by using a combination of token shifting, channel and time mixing, to replace LSTM and compute the next layer / hidden state. + +> Note the cascading digram is the theorectical optimal, in practise some trainers and/or inference implementation may batch the cascade to chunks of tokens (32/64/128/256/512), to reduce VRAM lookup and usage, and improve overall performance. ## What is channel / time mixing, explain it in simple terms?