-
Notifications
You must be signed in to change notification settings - Fork 212
/
Lecture 5 _ Convolutional Neural Networks.srt
6229 lines (4862 loc) · 106 KB
/
Lecture 5 _ Convolutional Neural Networks.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:08,435 --> 00:00:10,602
- Okay, let's get started.
2
00:00:13,372 --> 00:00:15,936
Alright, so welcome to lecture five.
3
00:00:15,936 --> 00:00:18,693
Today we're going to be getting
to the title of the class,
4
00:00:18,693 --> 00:00:21,193
Convolutional Neural Networks.
5
00:00:22,493 --> 00:00:24,134
Okay, so a couple of
administrative details
6
00:00:24,134 --> 00:00:25,933
before we get started.
7
00:00:25,933 --> 00:00:27,980
Assignment one is due Thursday,
8
00:00:27,980 --> 00:00:30,563
April 20, 11:59 p.m. on Canvas.
9
00:00:31,440 --> 00:00:35,607
We're also going to be releasing
assignment two on Thursday.
10
00:00:38,320 --> 00:00:40,434
Okay, so a quick review of last time.
11
00:00:40,434 --> 00:00:43,679
We talked about neural
networks, and how we had
12
00:00:43,679 --> 00:00:45,755
the running example of
the linear score function
13
00:00:45,755 --> 00:00:48,337
that we talked about through
the first few lectures.
14
00:00:48,337 --> 00:00:50,736
And then we turned this
into a neural network
15
00:00:50,736 --> 00:00:53,808
by stacking these linear
layers on top of each other
16
00:00:53,808 --> 00:00:56,969
with non-linearities in between.
17
00:00:56,969 --> 00:00:58,900
And we also saw that
this could help address
18
00:00:58,900 --> 00:01:01,500
the mode problem where
we are able to learn
19
00:01:01,500 --> 00:01:03,807
intermediate templates
that are looking for,
20
00:01:03,807 --> 00:01:06,618
for example, different
types of cars, right.
21
00:01:06,618 --> 00:01:09,006
A red car versus a yellow car and so on.
22
00:01:09,006 --> 00:01:11,138
And to combine these
together to come up with
23
00:01:11,138 --> 00:01:14,790
the final score function for a class.
24
00:01:14,790 --> 00:01:16,998
Okay, so today we're going to talk about
25
00:01:16,998 --> 00:01:18,438
convolutional neural networks,
26
00:01:18,438 --> 00:01:20,825
which is basically the same sort of idea,
27
00:01:20,825 --> 00:01:23,300
but now we're going to
learn convolutional layers
28
00:01:23,300 --> 00:01:26,134
that reason on top of basically explicitly
29
00:01:26,134 --> 00:01:29,217
trying to maintain spatial structure.
30
00:01:31,817 --> 00:01:33,397
So, let's first talk a little bit about
31
00:01:33,397 --> 00:01:36,070
the history of neural
networks, and then also
32
00:01:36,070 --> 00:01:39,067
how convolutional neural
networks were developed.
33
00:01:39,067 --> 00:01:43,796
So we can go all the way back
to 1957 with Frank Rosenblatt,
34
00:01:43,796 --> 00:01:46,308
who developed the Mark
I Perceptron machine,
35
00:01:46,308 --> 00:01:48,688
which was the first
implementation of an algorithm
36
00:01:48,688 --> 00:01:51,785
called the perceptron, which
had sort of the similar idea
37
00:01:51,785 --> 00:01:55,157
of getting score functions,
right, using some,
38
00:01:55,157 --> 00:01:58,437
you know, W times X plus a bias.
39
00:01:58,437 --> 00:02:02,000
But here the outputs are going
to be either one or a zero.
40
00:02:02,000 --> 00:02:04,295
And then in this case
we have an update rule,
41
00:02:04,295 --> 00:02:06,551
so an update rule for our weights, W,
42
00:02:06,551 --> 00:02:09,491
which also look kind of similar
to the type of update rule
43
00:02:09,491 --> 00:02:12,304
that we're also seeing in
backprop, but in this case
44
00:02:12,304 --> 00:02:15,889
there was no principled
backpropagation technique yet,
45
00:02:15,889 --> 00:02:18,182
we just sort of took the
weights and adjusted them
46
00:02:18,182 --> 00:02:22,349
in the direction towards
the target that we wanted.
47
00:02:23,771 --> 00:02:26,918
So in 1960, we had Widrow and Hoff,
48
00:02:26,918 --> 00:02:29,673
who developed Adaline and
Madaline, which was the first time
49
00:02:29,673 --> 00:02:33,290
that we were able to
get, to start to stack
50
00:02:33,290 --> 00:02:37,457
these linear layers into
multilayer perceptron networks.
51
00:02:38,986 --> 00:02:42,592
And so this is starting to now
look kind of like this idea
52
00:02:42,592 --> 00:02:46,658
of neural network layers, but
we still didn't have backprop
53
00:02:46,658 --> 00:02:50,992
or any sort of principled
way to train this.
54
00:02:50,992 --> 00:02:53,436
And so the first time
backprop was really introduced
55
00:02:53,436 --> 00:02:56,015
was in 1986 with Rumelhart.
56
00:02:56,015 --> 00:02:58,676
And so here we can start
seeing, you know, these kinds of
57
00:02:58,676 --> 00:03:00,858
equations with the chain
rule and the update rules
58
00:03:00,858 --> 00:03:03,906
that we're starting to
get familiar with, right,
59
00:03:03,906 --> 00:03:05,318
and so this is the first time we started
60
00:03:05,318 --> 00:03:06,791
to have a principled way to train
61
00:03:06,791 --> 00:03:09,874
these kinds of network architectures.
62
00:03:11,623 --> 00:03:14,961
And so after that, you know,
it still wasn't able to scale
63
00:03:14,961 --> 00:03:18,076
to very large neural networks,
and so there was sort of
64
00:03:18,076 --> 00:03:20,550
a period in which there wasn't a whole lot
65
00:03:20,550 --> 00:03:24,450
of new things happening
here, or a lot of popular use
66
00:03:24,450 --> 00:03:26,237
of these kinds of networks.
67
00:03:26,237 --> 00:03:28,623
And so this really started
being reinvigorated
68
00:03:28,623 --> 00:03:32,790
around the 2000s, so in
2006, there was this paper
69
00:03:33,641 --> 00:03:37,623
by Geoff Hinton and Ruslan Salakhutdinov,
70
00:03:37,623 --> 00:03:39,612
which basically showed that we could train
71
00:03:39,612 --> 00:03:40,719
a deep neural network,
72
00:03:40,719 --> 00:03:43,212
and show that we could
do this effectively.
73
00:03:43,212 --> 00:03:44,445
But it was still not quite
74
00:03:44,445 --> 00:03:47,428
the sort of modern iteration
of neural networks.
75
00:03:47,428 --> 00:03:50,208
It required really careful initialization
76
00:03:50,208 --> 00:03:52,439
in order to be able to do backprop,
77
00:03:52,439 --> 00:03:54,350
and so what they had
here was they would have
78
00:03:54,350 --> 00:03:57,601
this first pre-training
stage, where you model
79
00:03:57,601 --> 00:03:59,456
each hidden layer through this kind of,
80
00:03:59,456 --> 00:04:01,805
through a restricted Boltzmann machine,
81
00:04:01,805 --> 00:04:04,180
and so you're going to get
some initialized weights
82
00:04:04,180 --> 00:04:07,331
by training each of
these layers iteratively.
83
00:04:07,331 --> 00:04:09,583
And so once you get all
of these hidden layers
84
00:04:09,583 --> 00:04:13,898
you then use that to
initialize your, you know,
85
00:04:13,898 --> 00:04:16,891
your full neural network,
and then from there
86
00:04:16,891 --> 00:04:20,224
you do backprop and fine tuning of that.
87
00:04:23,057 --> 00:04:26,146
And so when we really started
to get the first really strong
88
00:04:26,146 --> 00:04:30,219
results using neural networks,
and what sort of really
89
00:04:30,219 --> 00:04:34,219
sparked the whole craze
of starting to use these
90
00:04:35,066 --> 00:04:39,233
kinds of networks really
widely was at around 2012,
91
00:04:40,268 --> 00:04:42,717
where we had first the strongest results
92
00:04:42,717 --> 00:04:44,980
using for speech recognition,
93
00:04:44,980 --> 00:04:47,921
and so this is work out
of Geoff Hinton's lab
94
00:04:47,921 --> 00:04:50,606
for acoustic modeling
and speech recognition.
95
00:04:50,606 --> 00:04:55,021
And then for image recognition,
2012 was the landmark paper
96
00:04:55,021 --> 00:04:58,604
from Alex Krizhevsky
in Geoff Hinton's lab,
97
00:04:59,638 --> 00:05:01,919
which introduced the first
convolutional neural network
98
00:05:01,919 --> 00:05:04,220
architecture that was able to do,
99
00:05:04,220 --> 00:05:06,813
get really strong results
on ImageNet classification.
100
00:05:06,813 --> 00:05:10,917
And so it took the ImageNet,
image classification benchmark,
101
00:05:10,917 --> 00:05:13,186
and was able to dramatically reduce
102
00:05:13,186 --> 00:05:15,519
the error on that benchmark.
103
00:05:16,793 --> 00:05:19,958
And so since then, you
know, ConvNets have gotten
104
00:05:19,958 --> 00:05:24,236
really widely used in all
kinds of applications.
105
00:05:24,236 --> 00:05:28,225
So now let's step back and
take a look at what gave rise
106
00:05:28,225 --> 00:05:31,714
to convolutional neural
networks specifically.
107
00:05:31,714 --> 00:05:34,113
And so we can go back to the 1950s,
108
00:05:34,113 --> 00:05:37,689
where Hubel and Wiesel did
a series of experiments
109
00:05:37,689 --> 00:05:41,003
trying to understand how neurons
110
00:05:41,003 --> 00:05:42,538
in the visual cortex worked,
111
00:05:42,538 --> 00:05:45,579
and they studied this
specifically for cats.
112
00:05:45,579 --> 00:05:48,273
And so we talked a little bit
about this in lecture one,
113
00:05:48,273 --> 00:05:51,362
but basically in these
experiments they put electrodes
114
00:05:51,362 --> 00:05:53,526
in the cat, into the cat brain,
115
00:05:53,526 --> 00:05:56,066
and they gave the cat
different visual stimulus.
116
00:05:56,066 --> 00:05:57,888
Right, and so, things like, you know,
117
00:05:57,888 --> 00:06:01,171
different kinds of edges, oriented edges,
118
00:06:01,171 --> 00:06:03,187
different sorts of
shapes, and they measured
119
00:06:03,187 --> 00:06:06,937
the response of the
neurons to these stimuli.
120
00:06:09,029 --> 00:06:12,765
And so there were a couple
of important conclusions
121
00:06:12,765 --> 00:06:14,993
that they were able to
make, and observations.
122
00:06:14,993 --> 00:06:17,021
And so the first thing
found that, you know,
123
00:06:17,021 --> 00:06:19,534
there's sort of this topographical
mapping in the cortex.
124
00:06:19,534 --> 00:06:22,246
So nearby cells in the
cortex also represent
125
00:06:22,246 --> 00:06:24,932
nearby regions in the visual field.
126
00:06:24,932 --> 00:06:27,767
And so you can see for
example, on the right here
127
00:06:27,767 --> 00:06:31,730
where if you take kind
of the spatial mapping
128
00:06:31,730 --> 00:06:34,475
and map this onto a visual cortex
129
00:06:34,475 --> 00:06:37,750
there's more peripheral
regions are these blue areas,
130
00:06:37,750 --> 00:06:41,722
you know, farther away from the center.
131
00:06:41,722 --> 00:06:44,122
And so they also discovered
that these neurons
132
00:06:44,122 --> 00:06:46,789
had a hierarchical organization.
133
00:06:47,634 --> 00:06:51,236
And so if you look at different
types of visual stimuli
134
00:06:51,236 --> 00:06:54,828
they were able to find
that at the earliest layers
135
00:06:54,828 --> 00:06:57,837
retinal ganglion cells
were responsive to things
136
00:06:57,837 --> 00:07:01,601
that looked kind of like
circular regions of spots.
137
00:07:01,601 --> 00:07:04,231
And then on top of that
there are simple cells,
138
00:07:04,231 --> 00:07:07,999
and these simple cells are
responsive to oriented edges,
139
00:07:07,999 --> 00:07:11,146
so different orientation
of the light stimulus.
140
00:07:11,146 --> 00:07:13,246
And then going further,
they discover that these
141
00:07:13,246 --> 00:07:15,448
were then connected to more complex cells,
142
00:07:15,448 --> 00:07:17,721
which were responsive to
both light orientation
143
00:07:17,721 --> 00:07:19,923
as well as movement, and so on.
144
00:07:19,923 --> 00:07:22,145
And you get, you know,
increasing complexity,
145
00:07:22,145 --> 00:07:25,452
for example, hypercomplex
cells are now responsive
146
00:07:25,452 --> 00:07:28,984
to movement with kind
of an endpoint, right,
147
00:07:28,984 --> 00:07:32,092
and so now you're starting
to get the idea of corners
148
00:07:32,092 --> 00:07:34,175
and then blobs and so on.
149
00:07:38,143 --> 00:07:38,976
And so
150
00:07:40,298 --> 00:07:44,247
then in 1980, the neocognitron
was the first example
151
00:07:44,247 --> 00:07:46,715
of a network architecture, a model,
152
00:07:46,715 --> 00:07:50,924
that had this idea of
simple and complex cells
153
00:07:50,924 --> 00:07:52,454
that Hubel and Wiesel had discovered.
154
00:07:52,454 --> 00:07:55,419
And in this case Fukushima put these into
155
00:07:55,419 --> 00:07:59,038
these alternating layers of
simple and complex cells,
156
00:07:59,038 --> 00:08:00,729
where you had these simple cells
157
00:08:00,729 --> 00:08:03,129
that had modifiable parameters,
and then complex cells
158
00:08:03,129 --> 00:08:06,799
on top of these that
performed a sort of pooling
159
00:08:06,799 --> 00:08:08,791
so that it was invariant to, you know,
160
00:08:08,791 --> 00:08:12,958
different minor modifications
from the simple cells.
161
00:08:14,786 --> 00:08:17,159
And so this is work that
was in the 1980s, right,
162
00:08:17,159 --> 00:08:19,242
and so by 1998 Yann LeCun
163
00:08:21,839 --> 00:08:23,445
basically showed the first example
164
00:08:23,445 --> 00:08:27,743
of applying backpropagation
and gradient-based learning
165
00:08:27,743 --> 00:08:29,645
to train convolutional neural networks
166
00:08:29,645 --> 00:08:32,063
that did really well on
document recognition.
167
00:08:32,063 --> 00:08:35,339
And specifically they
were able to do a good job
168
00:08:35,340 --> 00:08:37,610
of recognizing digits of zip codes.
169
00:08:37,610 --> 00:08:41,028
And so these were then used pretty widely
170
00:08:41,028 --> 00:08:45,082
for zip code recognition
in the postal service.
171
00:08:45,082 --> 00:08:48,320
But beyond that it
wasn't able to scale yet
172
00:08:48,320 --> 00:08:51,579
to more challenging and
complex data, right,
173
00:08:51,579 --> 00:08:53,837
digits are still fairly simple
174
00:08:53,837 --> 00:08:56,350
and a limited set to recognize.
175
00:08:56,350 --> 00:09:00,901
And so this is where
Alex Krizhevsky, in 2012,
176
00:09:00,901 --> 00:09:04,893
gave the modern incarnation of
convolutional neural networks
177
00:09:04,893 --> 00:09:08,900
and his network we sort of
colloquially call AlexNet.
178
00:09:08,900 --> 00:09:11,543
But this network really
didn't look so much different
179
00:09:11,543 --> 00:09:14,205
than the convolutional neural networks
180
00:09:14,205 --> 00:09:16,472
that Yann LeCun was dealing with.
181
00:09:16,472 --> 00:09:18,363
They're now, you know,
they were scaled now
182
00:09:18,363 --> 00:09:21,751
to be larger and deeper and able to,
183
00:09:21,751 --> 00:09:23,753
the most important parts
were that they were now able
184
00:09:23,753 --> 00:09:26,544
to take advantage of
the large amount of data
185
00:09:26,544 --> 00:09:30,711
that's now available, in web
images, in ImageNet data set.
186
00:09:32,078 --> 00:09:33,757
As well as take advantage
187
00:09:33,757 --> 00:09:37,724
of the parallel computing power in GPUs.
188
00:09:37,724 --> 00:09:41,033
And so we'll talk more about that later.
189
00:09:41,033 --> 00:09:43,127
But fast forwarding
today, so now, you know,
190
00:09:43,127 --> 00:09:45,434
ConvNets are used everywhere.
191
00:09:45,434 --> 00:09:49,999
And so we have the initial
classification results
192
00:09:49,999 --> 00:09:52,294
on ImageNet from Alex Krizhevsky.
193
00:09:52,294 --> 00:09:55,188
This is able to do a really
good job of image retrieval.
194
00:09:55,188 --> 00:09:57,274
You can see that when we're
trying to retrieve a flower
195
00:09:57,274 --> 00:09:59,488
for example, the features that are learned
196
00:09:59,488 --> 00:10:04,134
are really powerful for
doing similarity matching.
197
00:10:04,134 --> 00:10:07,049
We also have ConvNets that
are used for detection.
198
00:10:07,049 --> 00:10:10,557
So we're able to do a really
good job of localizing
199
00:10:10,557 --> 00:10:14,285
where in an image is, for
example, a bus, or a boat,
200
00:10:14,285 --> 00:10:17,705
and so on, and draw precise
bounding boxes around that.
201
00:10:17,705 --> 00:10:21,353
We're able to go even deeper
beyond that to do segmentation,
202
00:10:21,353 --> 00:10:23,145
right, and so these are now richer tasks
203
00:10:23,145 --> 00:10:26,112
where we're not looking
for just the bounding box
204
00:10:26,112 --> 00:10:27,958
but we're actually going
to label every pixel
205
00:10:27,958 --> 00:10:32,125
in the outline of, you know,
trees, and people, and so on.
206
00:10:34,126 --> 00:10:36,868
And these kind of algorithms are used in,
207
00:10:36,868 --> 00:10:38,864
for example, self-driving cars,
208
00:10:38,864 --> 00:10:42,066
and a lot of this is powered
by GPUs as I mentioned earlier,
209
00:10:42,066 --> 00:10:45,114
that's able to do parallel processing
210
00:10:45,114 --> 00:10:48,812
and able to efficiently
train and run these ConvNets.
211
00:10:48,812 --> 00:10:52,406
And so we have modern
powerful GPUs as well as ones
212
00:10:52,406 --> 00:10:55,634
that work in embedded
systems, for example,
213
00:10:55,634 --> 00:10:59,207
that you would use in a self-driving car.
214
00:10:59,207 --> 00:11:01,695
So we can also look at some
of the other applications
215
00:11:01,695 --> 00:11:03,399
that ConvNets are used for.
216
00:11:03,399 --> 00:11:06,227
So, face-recognition, right,
we can put an input image