Attn(Q,K,V)=softmax(QK^T/sqrt(d_k))V
P(w_t|w_<t)=softmax(W_o h_t)
L=-sum y log(y_hat)
theta=theta-eta nabla_theta L
LN(x)=(x-mu)/sqrt(sigma^2+eps)
FFN(x)=W2 sigma(W1 x)+b2
p_theta(x)=prod p(x_t|x_<t)
KL(p||q)=sum p log(p/q)
WE ARE
Takyon AI
Emerging the future with agility.