Transformer Neural Networks Derived from Scratch

  Переглядів 112,251

Algorithmic Simplicity

Algorithmic Simplicity

День тому

#transformers #chatgpt #SoME3 #deeplearning
Join me on a deep dive to understand the most successful neural network ever invented: the transformer. Transformers, originally invented for natural language translation, are now everywhere. They have fast taken over the world of machine learning (and the world more generally) and are now used for almost every application, not the least of which is ChatGPT.
In this video I take a more constructive approach to explaining the transformer: starting from a simple convolutional neural network, I will step through all of the changes that need to be made, along with the motivations for why these changes need to be made.
*By "from scratch" I mean "from a comprehensive mastery of the intricacies of convolutional neural network training dynamics". Here is a refresher on CNNs: • Why do Convolutional N...
Chapters:
00:00 Intro
01:13 CNNs for text
05:28 Pairwise Convolutions
07:54 Self-Attention
13:39 Optimizations

КОМЕНТАРІ: 191
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Video about Diffusion/Generative models coming next, stay tuned!
@mahmirr
@mahmirr 8 місяців тому
Was coming to comment this, thanks
@arslanjutt4282
@arslanjutt4282 6 місяців тому
Please make video
@micmac8171
@micmac8171 3 місяці тому
Please!
@rah-66comanche94
@rah-66comanche94 8 місяців тому
Amazing video ! I really appreciate that you explained the Transformer model *from scratch*, and didn't just give a simplistic overview of it 👍 I can definitely see that *a lot* of work was put into this video, keep it up !
@korigamik
@korigamik 2 місяці тому
Would you share the source code for the animations?
@IllIl
@IllIl 8 місяців тому
Dude, your explanations are truly next level. This really opened my eyes to understanding transformers like never before. Thank you so much for making these videos. Really amazing resource that you have created.
@tdv8686
@tdv8686 8 місяців тому
Thanks for your explanation; This is probably the best video on UKposts about the core of transformer architecture so far, other videos are more about the actual implementation but lack the fundamental explanation. I 100% recommend it to everyone on the field.
@benjamindilorenzo
@benjamindilorenzo 2 місяці тому
This is the best Video on Transformers i have seen on whole youtube.
@asier6734
@asier6734 8 місяців тому
I love the algorithmic way of explaining what mathematics does. Not too deep, not too shallow, just the right level of abstraction and detail. Please please explain RNNs and LSTMs, I'm unable to find a proper explanation. Thanks !
@RoboticusMusic
@RoboticusMusic 8 місяців тому
Thank you for not using slides filled with math equations. If someone understands the math they're probably not watching these videos, if they're watching these videos they're not understanding the math. It's incredible that so many UKposts teachers decide to add math and just point at it for an hour without explaining anything their audience can grasp, and then in the comments you can tell everybody golf clapped and understood nothing except for the people who already grasp the topic. Thank you again for thinking of a smart way to teach simple concepts.
@xt3708
@xt3708 8 місяців тому
amen. the power of out of the box teachers is infinite.
@anatolyr3589
@anatolyr3589 Місяць тому
yeah! this "functional" approach to the explanation rather than "mechanical" is truly amazing 👍👍👍👏👏👏
@mvlad7402
@mvlad7402 День тому
Excellent explanation! All kudos to the author!
@Muhammed.Abd.
@Muhammed.Abd. 8 місяців тому
That is the possibly the best explanation of Attention I have ever seen!
@corydkiser
@corydkiser 8 місяців тому
This was top notch. Please do one for RetNets and Liquid Neural Nets.
@StratosFair
@StratosFair 2 місяці тому
I am currently doing my PhD in machine learning (well, on its theoretical aspects), and this video is the best explanation of transformers I've seen on UKposts. Congratulations and thank you for your work
@tunafllsh
@tunafllsh 8 місяців тому
This video is exactly what I needed. Despite knowing what a transformer's made of, I still felt incompleteness and didn't know the motivation behind it. And your video answered this question perfectly. Now understanding why it works is another question.
@Magnetic-Milk
@Magnetic-Milk 5 місяців тому
Not so long ago I was searching for hours trying to understand transformers. In this 18 min video I learned more than I learned in 3 hours of researching. This is best computer science video I have ever watched in my entire life.
@jackkim5869
@jackkim5869 2 місяці тому
Truly this is the best explanation of transformers I have seen so far. Especially great logical flow makes it easier to understand difficult concepts. Appreciate your hard work!
@xt3708
@xt3708 8 місяців тому
Absolutely love how you explain the process of discovery, in other words figure out one part which then causes a new problem, which then can be solved with this method, etc. The insight into this process for me was even more valuable than understanding this architecture itself.
@chrisvinciguerra4128
@chrisvinciguerra4128 7 місяців тому
It seems like whenever I want to dive deeper into the workings of a subject, I always only find videos that simply define the parts to how something works, like it is from a textbook. You not only explained the ideas behind why the inner workings exist the way they do and how they work, but acknowledged that it was an intentional effort to take a improved approach to learning.
@jcorey333
@jcorey333 3 місяці тому
This is one of the genuinely best and most innovative explanations of transformers/attention I've ever seen! Thank you.
@diegobellani
@diegobellani 8 місяців тому
Wow just wow. This video makes you understanding really the reason behind the architecture, something that even reading the original paper you don't really get.
@terjeoseberg990
@terjeoseberg990 8 місяців тому
I wasn’t aware that they were using a convolutional neural network in the transformer, so I was extremely confused about why the positional vectors were needed. Nobody else in any of the other videos describing transformers pointed this out. Thanks.
@Hexanitrobenzene
@Hexanitrobenzene 8 місяців тому
"they were using a convolutional neural network in the transformer" No no, Transformers do not have any convolutional layers, the author of the video just chose CNN as a starting point in the process "Let's start with the solution that doesn't work well, understand why it doesn't work well and try to improve it, changing the solution completely along the way". The main architecture in natural language processing before transformers was RNN, recurrent neural network. Then in 2014 researchers improved it with attention mechanism. However, RNNs do not scale well, because they are inherently sequential, and scale is very important for accuracy. So, researchers tried to get rid of RNNs and succeded in 2017. CNNs were also tried, but, to my not-very-deep knowledge, were less succesful. Interesting that the author of the video chose CNN as a starting point.
@terjeoseberg990
@terjeoseberg990 8 місяців тому
@@Hexanitrobenzene, I suppose I’ll have to watch this video again. I’ll look for what you mentioned.
@Hexanitrobenzene
@Hexanitrobenzene 8 місяців тому
@@terjeoseberg990 A little off topic, but... Not long ago I noticed that UKposts deletes comments with links. Ok, automatic spam protection. (Still, the thing that it does this silently frustrates a lot...) But, does it also delete comments where links are separated into words with "dot" between them ? I tried to give you a resource I learned this from, but my comment got dropped two times...
@Hexanitrobenzene
@Hexanitrobenzene 8 місяців тому
...Silly me, I figured I could just give you the title you can search for: "Dive into deep learning". It's an open textbook with code included.
@terjeoseberg990
@terjeoseberg990 8 місяців тому
@@Hexanitrobenzene, The best thing to do when UKposts deletes comments is to provide a title or something so I can find it. A lot of words are banned too.
@TTTrouble
@TTTrouble 8 місяців тому
I’ve watched so many video explainers on transformers and this is the first one that really helped show the intuition in a unique and educational way. Thank you, I will need to rewatch this a few times but I can tell it has unlocked another level of understanding with regard to the attention mechanism that has evaded me for quite some time.(darned KQV vectors…) Thanks for your work!
@ItsRyanStudios
@ItsRyanStudios 8 місяців тому
This is AMAZING I've been working on coding a transformer network from scratch, and although the code is intuitive, the underlying reasoning can be mind bending. Thank you for this fantastic content.
@TeamDman
@TeamDman Місяць тому
I keep coming back to this because it's the best explanation!!
@TropicalCoder
@TropicalCoder 8 місяців тому
Very nicely done. Your graphics had a calming, almost hypnotic effect.
@CharlieZYG
@CharlieZYG 8 місяців тому
Wonderful video. Easily the best video I've seen on explaining transformer networks. This "incremental problem-solving" approach to explaining concepts personally helps me understand and retain the information more efficiently.
@user-eu2li6vf3z
@user-eu2li6vf3z 7 місяців тому
Cant wait for more content from your channel. Brilliantly explained.
@igNights77
@igNights77 7 місяців тому
Explained thoroughly and clearly from basic principles and practical motivations. Basically the perfect explanation video.
@briancase6180
@briancase6180 8 місяців тому
This a truly great introduction. I've watched other also excellent introductions, but yours is superior in a few ways. Congrats and thanks! 🤙
@ryhime3084
@ryhime3084 8 місяців тому
This was so helpful. I was reading through how other models work like ELMo and it makes sense how they came up with ideas for those, but the transformer it just seemed like it popped out of nowhere with random logic. This video really helps to understand their thought process.
@RalphDratman
@RalphDratman 8 місяців тому
This is by far the best explanation of the transformer architecture. Well done, and thank you very much.
@Muuip
@Muuip 7 місяців тому
Great concise visual presentation! Thank you, much appreciated! 👍👍
@giphe
@giphe 8 місяців тому
Wow! I knew about attention mechanisms but this really brought my understanding to a new level. Thank you!!
@TeamDman
@TeamDman 7 місяців тому
I've had to watch this a few times, great explanation!
@halflearned2190
@halflearned2190 5 місяців тому
Hey man, I watched your video months ago, and found it excellent. Then I forgot the title, and could not find it again for a long time. It doesn't show up when I search for "transformers deep learning", "transformers neural network", etc. Consider changing the title to include that keyword? This is such a good video, it should have millions of views.
@algorithmicsimplicity
@algorithmicsimplicity 5 місяців тому
Thanks for the tip.
@adityachoudhary151
@adityachoudhary151 3 місяці тому
really made me appreciate NN even more. Thanks for the video
@AdhyyanSekhsaria
@AdhyyanSekhsaria 8 місяців тому
Great explanation. Havent found this perspective before.
@JunYamog
@JunYamog 4 місяці тому
Your visualization and explanation are very good. Helped me understand a lot. I hope you can put more videos, it must be not easy otherwise you would have done it. Keep it up.
@hadadvitor
@hadadvitor 8 місяців тому
fantastic video, congratulations on and thank you for making it
@SahinKupusoglu
@SahinKupusoglu 8 місяців тому
This video was all I needed for LLMs/transformers!
@iustinraznic5811
@iustinraznic5811 7 місяців тому
Amazing explainations and video!
@ArtOfTheProblem
@ArtOfTheProblem 8 місяців тому
Really well done, I haven't seen your channel before and this is a breath of fresh air. I've been working on my GPT + transformer video for months and this is the only video online which is trying to simplify things through an indepdnent realization approach. Before I watched this video my 1 sentence summary of why Transformers matter was: "They contain layers that have weights which adapt based on context" (vs. using deeper networks with static layers). and this video helped solidify that further, would you agree? I also wanted to boil down the attention heads as "mini networks" (or linear functions) connected to each token which are trained to do this adaptation. One network pulls out what's important in each word given the context around it, the other networks combines these values to decide the important those two words in that context, and this is how the 'weights adapt' I still wonder how important the distinction of linear layer vs. just a single layer, I like how you pulled that into the optimization section. i know how hard this stuff is to make clear and you did well here
@maxkho00
@maxkho00 7 місяців тому
My one-sentence summary of why transformers matter would be "they are standard CNNs, except the words are re-ordered in a way that makes the CNN's job easier first before being fed ". Also, a single NN layer IS a linear layer; I'm not sure what you mean by saying you don't know how important the distinction between the two is.
@ArtOfTheProblem
@ArtOfTheProblem 7 місяців тому
thanks@@maxkho00
@rogerzen8696
@rogerzen8696 4 місяці тому
Good job! There was a lot of intuition in this explanation.
@ronakbhatt4880
@ronakbhatt4880 4 місяці тому
What a simple but perfect explanation!! You deserve 100s time more subscriber.
@yonnn7523
@yonnn7523 7 місяців тому
best explainer of transformers I saw so far, thnx!
@pravinkool
@pravinkool 5 місяців тому
Fantastic! Loved it! Exactly what I needed.
@clray123
@clray123 8 місяців тому
Great video, maybe you could cover retentive network (from the RetNet paper) in the same fashion next - as it aims to be a replacement for the quadratic/linear attention in transformer (I'm curious as to how much of the "blurry vector" problem their approach suffers from).
@nara260
@nara260 4 місяці тому
thank a lot lot! this visual lecture cleared the dense fogs over my cognitive picture of the transformer.
@quocanhad
@quocanhad 2 місяці тому
you deserve my like bro, really awesome video
@TaranovskiAlex
@TaranovskiAlex 7 місяців тому
thank you for the explanation!
@_MrKekovich
@_MrKekovich 8 місяців тому
FINALLY I have something me basic understanding. Thank you so much!
@christrifinopoulos8639
@christrifinopoulos8639 4 місяці тому
The visualisation was amazing.
@shantanuojha3578
@shantanuojha3578 12 днів тому
Awesome video bro. i always like some intutive explanation.
@algorithmicsimplicity
@algorithmicsimplicity 12 днів тому
Thanks so much!
@albertmashy8590
@albertmashy8590 8 місяців тому
This was amazing
@rishikakade6351
@rishikakade6351 10 днів тому
Insane that this website is free. Thanks!
@dmlqdk
@dmlqdk 2 місяці тому
Thank you for answering my questions!!
@algorithmicsimplicity
@algorithmicsimplicity 2 місяці тому
Thanks for the tip! I'm always happy to answer questions.
@marcfruchtman9473
@marcfruchtman9473 8 місяців тому
Very interesting. Thank you for the video.
@IzUrBoiKK
@IzUrBoiKK 8 місяців тому
As both a math enthusiasts and a programme (who obv also works on AI) I rly liked this vid. I can confirm that this is one of the best and genuine explanation of transformers...
@ArtOfTheProblem
@ArtOfTheProblem 8 місяців тому
the first so far this year
@palyndrom2
@palyndrom2 8 місяців тому
Great video
@sairaj6875
@sairaj6875 8 місяців тому
Thank you!!
@minhsphuc12
@minhsphuc12 7 місяців тому
Thank you so much for this video.
@vedantkhade4395
@vedantkhade4395 3 місяці тому
This video is damn impressive mann
@lakshay510
@lakshay510 2 місяці тому
Halfway through the video and I pressed the subscribed button. Very intutive and easy to understand. Keep up the good work man :) 1 suggestion: Change the title of video and you'll get more traction.
@algorithmicsimplicity
@algorithmicsimplicity 2 місяці тому
Thanks, any title in particular you'd recommend?
@user-js7ym3pt6e
@user-js7ym3pt6e 3 місяці тому
Amazing, continue like this.
@Einken
@Einken 8 місяців тому
Transformers, more than meets the eye.
@christianjohnson961
@christianjohnson961 8 місяців тому
Can you do a video on tricks like layer normalization, residual connections, byte pair encoding, etc.?
@TheSonBAYBURTLU
@TheSonBAYBURTLU 7 місяців тому
Thank you 🙂
@cem_kaya
@cem_kaya 8 місяців тому
Thank you so much
@arongil
@arongil 8 місяців тому
Great, thank you!
@yash1152
@yash1152 8 місяців тому
2:36 wow, just 50k words... that soud pretty easy for computers. amazing.
@anilaxsus6376
@anilaxsus6376 8 місяців тому
best explanation i have seen so far. Basically The transformer is cnn with a lot of extra upgrades. Good to know.
@laithalshiekh3792
@laithalshiekh3792 8 місяців тому
Your video is amazing
@Supreme_Lobster
@Supreme_Lobster 8 місяців тому
Thanks. I had read the original Transformer paper and I barely understood the underlying ideas.
@Tigerfour4
@Tigerfour4 8 місяців тому
Great video, but it left me with a question. I tried to compare what you arrived at (16:25) to the original transformer equations, and if I understand it correctly, in the original we don't add the red W2X matrix, but we have a residual connection instead, so it is as if we would add X without passing it through an additional linear layer. Am I correct in this observation, and do you have an explanation for this difference?
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Yes that's correct, the transformer just adds x without passing it through an additional linear layer. Including the additional linear layer doesn't actually change the model at all, because when the result of self attention is run through the MLP in the next layer, the first thing the MLP does is apply a linear transform to the input. Composition of 2 linear transforms is a linear transform, so we may as well save computation and just let the MLP's linear transform handle it.
@cloudysh
@cloudysh 5 місяців тому
This is perfect
@introstatic
@introstatic 8 місяців тому
This is brilliant. Could you give a hint where to look for details of the idea of the pairwise convolution layer? Can't find anything with this exact wording.
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Yeah it's a term I made up so you won't find it in any sources, sorry about that. Usually sources will just talk about self attention in terms of key, query and value lookups, so you can look at those to get a more detailed understanding of the transformer. The value transform is equivalent to the linear representation function I use in the pairwise convolution, the key and query attention scores are equivalent to the bi-linear form scoring function I use (with the bi-linear form weight matrix given by Q^TK). I chose to use this unusual terminology because, personally, I feel the key, query and value terminology comes out of nowhere, and I wanted to connect the transformer more directly to its predecessor (the CNN).
@introstatic
@introstatic 8 місяців тому
@algorithmicsimplicity, this is a surprising connection. Thanks a lot for the explanation.
@dsagman
@dsagman 8 місяців тому
@@algorithmicsimplicityit would be great if you could make this connection between terminology in video form. maybe next time?
@iandanforth
@iandanforth 8 місяців тому
I wish this had tied in specifically to the nomenclature of the transformer such as where these operations appear in a block, if they are part of both encoder and decoder paths, how they relate to "KQV" and if there's any difference between these basic operations and "cross attention".
@ArtOfTheProblem
@ArtOfTheProblem 8 місяців тому
I"ll be doing this, but in short, the little networks he showed connected to each pair are KQ (word pair representation) and the V is the value network., all of this can be done in the decoder only model as well. and cross attention is the same thing but you are using two separate sequences looking at each other (such as two sentences in a translation network). it's nice to know that GPT for example is decorder only, and so doesn't even need this
@AerialWaviator
@AerialWaviator 8 місяців тому
Very fascinating topic with an excellent dive and insights into how neural networks derive results. One thing I was left wondering is why is there no scoring vector describing the probability a word is a noun, verb. or adjective? Encoding a words context (regardless of language), should provide a great deal of context and thus eliminating many convolutional pairings, reducing computational effort. Thanks for a new found appreciation of transformers.
@ArtOfTheProblem
@ArtOfTheProblem 8 місяців тому
this is a good question and it's also a GOFAI type approach where we make the mistake thinking we can inject some human semantic idea to improve a network. but the reality is it will do this automatically without our help. For example papers back in 1986 show tiny networks automatically grouping words into nouns or verbs, it's amazing. let me know if you want more details
@komalsinghgurjar
@komalsinghgurjar 6 місяців тому
Sir I like your videos very much. Love from India ♥️♥️.
@AdhyyanSekhsaria
@AdhyyanSekhsaria 8 місяців тому
Thanks!
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Thank you so much for the kind words.
@nightchicken3517
@nightchicken3517 7 місяців тому
I really love SoME
@frederik7054
@frederik7054 8 місяців тому
The video is of great quality! With which tool did you create this? Manim?
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Yep all my videos so far have been done in Manim.
@GaryBernstein
@GaryBernstein 8 місяців тому
Can you explain how the NN produces the important-word-pair information-scores method described after 12:15 from the sentence problem raised at 10:17? Well it’s just another trained set of values. I supposs it scores pairs importance over the pairs’ uses in ~billions of sentences.
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
The importance-scoring neural network is trained in exactly the same way that the representation neural network is. Roughly speaking, for every weight in the importance-scoring neural network you increase the value of that weight slightly and then re-evaluate the entire transformer on a training example. If the new output is closer to the training label, then that was a good change so the weight stays at its new value. If the new output is further away, then you reverse the change to that weight. Repeat this over and over again on billions of training examples and the importance-scoring neural network weights will end up set to values so that that the produced scores are useful.
@korigamik
@korigamik 7 місяців тому
This is great, can you share the source code for the video?
@AurL_69
@AurL_69 3 місяці тому
Holy pepperoni you're great !
@user-km3kq8gz5g
@user-km3kq8gz5g 4 місяці тому
You are amazing
@cezarydziemian6734
@cezarydziemian6734 7 місяців тому
Wow, great video, but have some problem understaning one thing. I'm trying to understand it watching all 3 videos and what I have trouble to understand is how these pairs of words (vectors) from the first layer are match together into new vectors. For exaple, for "catsat" pair, we have twe vectors: [0001] and [0100]. How are they transformed to the vector [1.3, -0.9...]? If this is just the result of some internal neural net, where did the data (wages) for this net came from? Or if they started fom random numebers, how ware they trained?
@algorithmicsimplicity
@algorithmicsimplicity 7 місяців тому
The pair vectors are first concatenated together into one vector e.g. [00010100], and this vector is then run through the neural network which produces the output vector. The output is the result of the weights in the neural network. Initially, those weights are completely random (usually sampled from a normal distribution centred at 0), and then they are updated during training. The neural network is trained on a labelled training dataset of input and output pairs. For example, ChatGPT was trained to do next word prediction on billions of passages of text scraped from the internet. In this case, each training example is a random part of a text passage (e.g. "the cat sat on the") and the output is the next word that occurs in the text (e.g. "mat"). For every training example an update step is performed on the neural network to update all of the weights of all of the layers. The update step works as follows: 1) Evaluate the neural network on the input. 2) For every weight in every layer, increase the value of that weight by a small amount (e.g. 0.001) and then re-evaluate the entire neural network on the input. If the new output is closer to the target (e.g. the vector output is closer to the one-hot encoding of "mat") then it was good to change that weights value, so it keeps the new value. If the new output is further away from the target, then it was a bad change, so reverse it. And that's it. Just keep repeating that update step for billions of different inputs and all of the weights in all layers will eventually be set to values which allow the transformer as a whole to map inputs to outputs correctly. Also I should point out that in practice there is a faster way to do the update step which is called backprop. Backprop computes exactly the same result as the update process I described, it is just faster computationally (you only need to evaluate the model twice instead of once for every weight), but it is also more difficult to understand.
@izainonline
@izainonline 8 місяців тому
Please share more informative video
@domasvaitmonas8814
@domasvaitmonas8814 Місяць тому
Thanks. Amazing video. One question though - how do you train the network to output the "importance score"? I get the other part of the self-attention mechanism, but the score seems a bit out of the blue.
@algorithmicsimplicity
@algorithmicsimplicity Місяць тому
The entire model is trained end-to-end to solve the training task. What this means is you have some training dataset consisting of a bunch of input/label pairs. For each input, you run the model on that input, then you change the parameters in the model a bit, evaluate it again and check if the new output is closer to the training label, if it is you keep the changes. You do this process for every parameter in all layers and in all value and score networks, at the same time. By doing this process, the importance score generating networks will change over time so that they produce scores which cause the model's outputs to be closer to the training dataset labels. For standard training tasks, such as predicting the next word in a piece of text, it turns out that the best way for the score generating networks to influence the model's output is by generating 'correct' scores which roughly correspond to how related 2 words are, so this is what they end up learning to do.
@Baigle1
@Baigle1 6 місяців тому
I think they were actually used as far back or more as 2006, in compressor algorithm competitions publicly
@user-pw5do6tu7i
@user-pw5do6tu7i 8 місяців тому
i understood about 40% of that. how does the transformer deal with "the lion saw that the fire was roaring"? like what is the pair value protection against attributing roaring to the lion in this case? are the other words going to just dominate in value and suppress the lion-roaring pair value?
@ewthmatth
@ewthmatth 8 місяців тому
This video is being recommended under videos about power grid transformers. I guess it's due to the... algorithmic simplicity.. of UKposts itself. (I'll see myself out)
@kdjshfihekls
@kdjshfihekls 7 місяців тому
I had done a whole project on transformers before watching this video, and i felt like i still didn't understand them completely in detail. After watching this video i really feel like i understand transformers. I just have one question. Do the neural network that calculates the attention scores take all the word pairs in the column as input, or just one word pair?
@algorithmicsimplicity
@algorithmicsimplicity 7 місяців тому
The neural network takes as input a single pair (concatenation of both vector in the pair) and produces a single scalar output score for that pair. This same network is applied to every pair to produce all n^2 scores. The scores are then normalized across columns (exponentiate each score, then divide by sum of column). In practice, transformers don't use a neural network to produce scores but a simple bi-linear function, which is typically referred to as the "key-query dot product self attention".
@alanraftel5033
@alanraftel5033 8 місяців тому
9:07 Will the result be similar if we sum the vectors down each row instead of each column?
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Yep, everything is completely symmetric for rows/columns, since in either case you are still summing over all pairs that contain a particular word.
@dmlqdk
@dmlqdk 4 місяці тому
How does the explanation in this video relate to Query, Key and Values (as defined in the Attention is all you need paper)? This is really a great video - thank you!!
@algorithmicsimplicity
@algorithmicsimplicity 4 місяці тому
The "key-query" attention scoring is equivalent to the bi-linear scoring function in my explanation, where the bi-linear form matrix is given by K^TQ. The value transformation V is exactly the linear representation function in my explanation. I still have no idea why they decided to give the scoring function matrix two different names (key and query), it just confuses everyone.
@dmlqdk
@dmlqdk 2 місяці тому
Let's assume X is our input, a sentence containing N words. Each word has embedding dimension of size P. Thus X is an NxP matrix. Then according to the "Attention is All You Need" paper, we have: K = X * W_k Q = X * W_q V = X * W_v A = softmax(QK^T) Output = AV Where: W_k, W_q and W_v are PxP matrices K, Q and V are NxP matrices. A is an NxN matrix Output is an NxP matrix. I am confused about how the V matrix connects to the "pair-wise representations". In the video, you show operations being done on pairs of words (such as 13:40). However, there doesn't seem to be any pair-wise operations occurring when computing V? If there was a pair-wise operation, wouldn't the dimension of W_v need to be NxN instead of PxP? I agree that we are computing an a single "attention" scalar value for each word pair. This is why A has dimension NxN. However, it seems like V contains individual representation of the words that are "smooshed" together when we multiply by A, rather than V containing (or operating on) pair-wise representations? Again, great video! And I greatly appreciate your help!!@@algorithmicsimplicity
@algorithmicsimplicity
@algorithmicsimplicity 2 місяці тому
@@dmlqdk When you apply the linear transform V to the pair [x1, x2] the result is V1x1 + V2x2. Basically we are applying a linear transform to each input and summing them. Because, in a given column, x2 is the same for every pair, you are effectively just adding a constant value to each V1x_i. You can factor this constant value outside of the attention weights, at which point it just becomes part of the residual term. I explained this process in more detail here: www.reddit.com/r/MachineLearning/comments/17cmzcz/comment/k5t7g70/?context=3 At this point, you no longer have 'pair' representations, since each value vector is now just a linear transform applied to one word. Each column of the [NxN] grid of value vectors contains V1x_i for i in {1,...n}, i.e. all of the columns are identical. Since all of the columns are identical, instead of elementwise multiplying the matrix of attention values by the matrix of value vectors and then summing, you can instead rewrite this operation as a single matrix-vector product, which is what the AV operation is in the standard self attention. V is that column of value vectors, where each entry is just V1x_i.
@dmlqdk
@dmlqdk 2 місяці тому
This makes so much more sense now. Thank you!!@@algorithmicsimplicity
@pi5549
@pi5549 8 місяців тому
I really hope you push forwards with this approach. I can't find anywhere a clear and complete from-the-ground-up exposition of Transformers. Sorry to say I can't find it here either. You start with image ConvNets. I think you might break this down into (1) construct a representation that captures long-range information, and (2) a classifier, and observe that once we have the representation we could use it for tasks other than classification. What's jarring here is that I've only seen conv-nets in the context of classification, and classification-of-a-sentence is almost meaningless, unless we just want a sentiment-analyzer or something trivial. I'd like to see a section that explains "First we get the representation, then we can USE that to construct a next-word-predictor". If the initial problem/scenario isn't well framed, all the internals feel fuzzy, as they're not representing steps towards a clear goal. I really hope you consider running at this again.
@ArtOfTheProblem
@ArtOfTheProblem 8 місяців тому
i'm working on a video now which is attempting to do this, been on it for months. one key thing here I notice where you get fuzzy, is thinking CNN's 'only do classification', and that this is different than nex word prediction. Because next word prediction is a type of classification (the output class is the next letter or word). and so you could have just a plain old fully connected network do "next word prediction" based on training it that way. please let me know what else you are thinking as it might help me with my script in my video I will open with RNN's applied to next work prediction (starts in 1986) then I will explain where they break and why we need parallel approaches why simple brute force doesn't work (too many parameters, and hard to train) why transformers helped (compressed many layers into fewer adaptive layers)
@pi5549
@pi5549 7 місяців тому
@@ArtOfTheProblem uff, I'm not sure my comment made any sense at all. I'll reply more on your latest micro-vid, which considers an Attention block as a dynamic-routing layer.
@clehaxze
@clehaxze 8 місяців тому
For reference, there is a good CNN like architecture for text called RWKV. It clams to be an RNN. But the authors admit in interview that they effectively rolls the RNN out in time so it acts like CNN when running for efficiency.
@algorithmicsimplicity
@algorithmicsimplicity 8 місяців тому
Yeah RWKV is similar to state-space models and long-convolutional models, the key thing that makes these methods work is regularizing the convolutional filter to have larger weights for nearby elements and smaller weights for far away elements (for RWKV this is achieved by multiplying vectors by an exponentially decaying value). I am hoping to make a video about this class of models eventually.
@nikims_
@nikims_ 8 місяців тому
@@algorithmicsimplicity Please do! i've been looking forward to such a video about RWKV for a long time
@d-star491
@d-star491 7 місяців тому
Isn't the red term being added to each vector already a kind of residual connection?
@algorithmicsimplicity
@algorithmicsimplicity 7 місяців тому
Yep! so you don't even need to include it at all if you already have residual connections.
@komalsinghgurjar
@komalsinghgurjar 6 місяців тому
Sir please tell me which software or how u make amazing animation depicting numerical systems Im interested in making these types of videos with presentation for physics tutorials. Thanks
@algorithmicsimplicity
@algorithmicsimplicity 6 місяців тому
My videos were made using the Python package Manim (www.manim.community/)
@masilivesifanele5384
@masilivesifanele5384 8 місяців тому
Can you implement this in C or Vanilla C#?
@edh615
@edh615 8 місяців тому
You need to update this for RetNets
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Переглядів 4,6 тис.
Why do Convolutional Neural Networks work so well?
16:30
Algorithmic Simplicity
Переглядів 34 тис.
How did CatNap end up in Luca cartoon?🙀
00:16
LOL
Переглядів 6 млн
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Переглядів 73 тис.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Переглядів 182 тис.
A shallow grip on neural networks (What is the "universal approximation theorem"?)
11:02
Here is how Transformers ended the tradition of Inductive Bias in Neural Nets
12:05
Neural Breakdown with AVB
Переглядів 6 тис.
ChatGPT’s Amazing New Model Feels Human (and it's Free)
25:02
Matt Wolfe
Переглядів 64 тис.
How are memories stored in neural networks? | The Hopfield Network #SoME2
15:14
Layerwise Lectures
Переглядів 652 тис.
Self-Attention Using Scaled Dot-Product Approach
16:09
Machine Learning Studio
Переглядів 12 тис.
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
13:05
GPT-4o - Full Breakdown + Bonus Details
18:43
AI Explained
Переглядів 169 тис.
Visual Guide to Transformer Neural Networks - (Episode 2) Multi-Head & Self-Attention
15:25