forked from anhnguyendepocen/lectures-8
-
Notifications
You must be signed in to change notification settings - Fork 0
/
lec01.Rmd
687 lines (465 loc) · 19.8 KB
/
lec01.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
---
title: "Lecture 1"
author: "DJM"
date: "25 September 2018"
output:
slidy_presentation:
css: http://mypage.iu.edu/~dajmcdon/teaching/djmRslidy.css
font_adjustment: 0
pdf_document: default
---
# Syllabus
## Course info
__Instructor__:
Daniel McDonald
Office: Harper Center 359
Website: http://mypage.iu.edu/~dajmcdon
Email: daniel.mcdonald@chicagobooth.edu
__Office hours__:
Me: Tuesday 11-12:30, by appointment
__Course webpage__:
WWW: https://booth41911.github.io/
Github: https://github.com/booth41911
Piazza: https://piazza.com/chicagobooth/fall2018/bus41911/home
Piazza sign-up: https://piazza.com/chicagobooth/fall2018/bus41911
__Lectures__:
Tuesday 2-5pm, Harper Center Seminar Room 3SW
__Textbook__:
Required: ESL = [_Elements of Statistical Learning_](https://web.stanford.edu/~hastie/ElemStatLearn/)
__Prerequisite__:
Ph.D.-level course(s) in mathematical statistics and econometrics at the level of Business 41901, 41902, 41903 or Economics 31000-31200.
## Course objectives
Topics:
1. Data analysis workflow;
2. Supervised learning;
3. Model selection;
4. Dimension reduction;
5. Graphical Models/Time series/Causal inference if time allows.
Applications from finanace/macro/marketing as best I can.
Emphasis on developing a workflow to avoid mistakes.
## Lectures
Class time will consist of a
combination of lecture, discussion, questions and answers,
and problem
solving,
with a focus on problem solving.
You are strongly encouraged
to bring a laptop to class.
## Textbook
The required textbook for this class is the standard for ML at this level.
It is excellent and free electronically.
There are some extra suggested materials (especially for this week). They are linked on the website.
If you find additional resources, let me know! I'll link them.
## Course communication
We'll use Piazza. (Seems common to Booth?)
Try to constrain communication to Piazza.
## Grading
* First class pre-test, 5% (completion)
* Participation on in-class exercises and Piazza, 15%
* Group/individual homework (3-4 exercises), 20%
* Take-home midterm exam, 25%
* Course project, 10% (PC1), 25% (PC2)
## In-class exercises
We will regularly have short programming or theoretical exercises during class time.
## Homeworks
* All assignments will be submitted on Github.
* Homework grades
will be awarded based on complete and accurate analyses according to the rubric provided with each assignment.
* Note that the submission repo is __public__ (to the class)
## Project
* You will be required to complete a project (1-3 people).
* You may choose this group.
* The goal is to analyze a dataset in depth using anything that you have learned.
* See the [project](https://booth41911.github.io/project.html) page for detailed requirements.
* This is a PhD course.
The entire class will give brief presentations during the last week of classes.
## Exams
* One take-home exam to be completed entirely individually.
* The midterm will be made available by __Thursday, October 25__
* Submission by __Thursday, November 1__ at 11:59pm.
> If either of these dates pose some type of problem, you must notify me no later than Friday, September 28. After this date, no alternatives will be considered without medical documentation.
## Solutions
* Some of the problems that are assigned are similar or identical to those assigned in previous years
by me or other instructors for this or other courses.
* Using proofs or
code from anywhere other than the textbooks (with attribution), this
year's course notes (with attribution), or the course website is not
only considered cheating
(as described above), it is easily detectable cheating.
## Cheating
(Things I hope I don't have to say.)
* In previous years, I have caught students cheating on exams.
* Cheating, in my experience, occurs because students don't understand the material, so the result is usually a failing grade even before I impose any penalty and report the incident to the Dean's office.
* I carefully structure exams to make it so that I can catch these issues.
* I ~~will~~ catch you, and it does not help.
* If you are struggling, I want to help.
> If I suspect cheating, your case will be forwarded to the Dean's office. No questions asked.
# Git + Github
## Acknowledgements
* Much of this lecture is borrowed/stolen from Colin Rundel and Karl Broman
## Why version control?
<p align=center>
![This is why](gfx/versioncontrol.png)
</p>
## Why version control?
* Simple formal system for tracking all changes to a project
* Time machine for your projects
+ Track blame and/or praise
+ Remove the fear of breaking things
* Learning curve is steep, but when you need it you REALLY need it
> Your closest collaborator is you six months ago, but you don’t reply to emails.
> -- _Paul Wilson_
## Why Git
* You could use something like Box or Dropbox
* These are poor-man's version control
* Git is much more appropriate
* It works with large groups
* It's very fast
* It's much better at fixing mistakes
* Tech companies use it (so it's in your interest to have some experience)
> This will hurt, but what doesn't kill you, makes you stronger.
## Why part 2
* As a student, I didn't know anything about git/github.
* No one taught me how to use it.
* As a Professor who collaborates a lot, I find it indispensible.
* I had to learn it on my own, and I still have trouble.
* If I make you do it, you'll be better off than I was.
__JMLR vs. dajmcdon/dpf__
## Set up
* Open Rstudio
* Go to the "Tools" menu and select "Shell"
* Type
```
git --version
```
* If it's there, you're done.
* For detailed install instructions, or to be more choosy in the version you get, see [Happy git with R](http://happygitwithr.com/install-git.html)
## (Hard way, Windows)
* Install [Git for Windows](https://git-for-windows.github.io/), also known as `msysgit` or "Git Bash",
* **NOTE:** Select "Use Git from the Windows Command Prompt" during installation. Otherwise, we believe it's OK to accept the defaults.
* Note that RStudio for Windows prefers for Git to be installed below `C:/Program Files`, often at `C:/Program Files/Git/bin/git.exe`.
## (Hard way, Mac)
Install the Xcode command line tools (**not all of Xcode**), which includes Git. If your OS is older than 10.11 El Capitan, it is possible that you **must** install the Xcode command line tools in order for RStudio to find and use Git.
Go to the shell and enter one of these commands to elicit an offer to install developer command line tools:
``` bash
git --version
git config
```
Accept the offer! Click on "Install".
## (Hard way, Linux)
Install Git via your distro's package manager.
Ubuntu or Debian Linux:
```sh
sudo apt-get install git
```
Fedora or RedHat Linux:
```sh
sudo yum install git
```
## Pause for actions
<p align=center>
![graph](gfx/git-graph.png)
</p>
## Initial configuration
* Now tell it who your are:
```
$ git config --global user.name "Daniel McDonald"
$ git config --global user.email "dajmcdon@gmail.com"
$ git config --global core.editor nano
```
* `nano` is a light-weight text editor. You probably won't ever use it, but it's good to tell
git to use nano. If for some reason git asks you questions about doing things, just remember
> Ctrl + X is quit
(it says this at the bottom of the window)
* This is all that we will do with git in the terminal
* Rstudio has git built in, and so we'll use it through there mostly
* Using the terminal gives access to more complicated issues, which hopefully we won't have to deal with.
## Github
* This is a commercial site that interacts with Git (GitLab and Bitbucket do as well)
* You need a free account http://www.github.com/ (get the Education one)
* Git tracks changes to your files in both places
* Your files live in __repositories__ (repos in the lingo)
* You and others share access to repos
## When you want to work on a file
* You __pull__ the repo onto your local machine
* Make all your changes, saving as many times as you like.
* You __commit__ your changes with a message, e.g.: "I fixed the horrible bug"
* You __push__ your changes back to the repo
## Syncing up with Rstudio
Demonstration:
1. Preferences > Git/svn
2. If there is a key there, and you know when you made it/how to use it, skip to 4.
3. If empty, "Create"
4. "View public key", copy.
5. Github.com > Profile photo > Settings > SSH and GPG key > New SSH
6. Paste
Now you don't need your passphrase (from this computer)
## In class practice
* Go to https://github.com/booth41911/class-roster/
* Clone or download.
* Rstudio > New Project > Version Control > Git
* Add your information to the table.
* Save.
* Stage (click the box). Commit (add message).
* __Pull__
* Fix issues.
* Push
# R/Rmarkdown
## Mark Up, Markdown
* You are probably used to word processing programs, like Microsoft Word, which
employ the "what you see is what you get" (WYSIWYG) principle:
- you want italics, click, and, lo, they're in italics
- you want some to be in a bigger, different font and you just select
the font, and so on.
* This works well enough for n00bs but is not a viable
basis for a system of text formatting
* because it depends on a particular
program (a) knowing what you mean and (b) implementing it well.
* The essential idea in a __mark-up language__ is that it consists of ordinary text, _plus_ signs which indicate
how to change the formatting or meaning of the text.
* Some mark-up languages,
like HTML (Hyper-Text Markup Language) use very obtrusive markup; others,
like the language called __Markdown__, are more subtle.
## Example
For instance, the
last few sentences in Markdown look like this:
```
* The essential idea in a
__mark-up language__ is that it consists of ordinary text, _plus_ signs which
indicate how to change the formatting or meaning of the text.
* Some mark-up
languages, like HTML (Hyper-Text Markup Language) use very obtrusive markup;
others, like the language called __Markdown__, are more subtle.
```
## Rendering
* Every mark-up language needs to be __rendered__ somehow into a format which
actually includes the fancy formatting, images, mathematics, computer code,
etc., etc., specified by the mark-up.
* For HTML, the rendering program is
called a "web browser".
* Most computers which know how to work with Markdown
at all know how to render it as
- HTML (which you can then view in a browser),
- PDF (uses LaTeX, which you can then view in Acrobat or the like),
- Word (which you can
then view in the abomination of Redmond).
## Advantages
* portability across machines
* no proprietary software (I'm looking at you Bill)
* stability
R Markdown is, in
particular, both "free as in beer" (you will never pay a dollar for software to
use it) and "free as in speech" (the specification is completely open to all to
inspect).
Even if you are completely OK with making obeisance to the
Abomination of Redmond every time you want to read your own words, the sheer
stability of mark-up languages makes them superior for scientific documents.
## Rendering and Editing
To write R Markdown, you will need a text editor, a program which lets you read
and write plain text files. You will also need R, and the package `rmarkdown`
(and all the packages it depends on).
* Most computers come with a text editor (TextEdit on the Mac, Notepad on Windows machines, etc.).
* There are also lots of higher-powered text editors; I use [Emacs](http://www.gnu.org/software/emacs/emacs.html), but I admit it has a harsh learning curve.
* You _could_ use Word (or some other WYSIWYG word processor), taking care to always save your document in plain text format. I do not recommend this.
* [R Studio](http://www.rstudio.com) comes with a built-in text editor, which knows about, and has lots of tools for, working with R Markdown documents.
If this is your first time using a text editor for something serious, I recommend using R Studio. (That's what I'm using to write this.)
## Rendering in R Studio
Assuming you have the document you're working on open in the text editor,
click the button that says "knit".
## Basic Formatting in R Markdown
For the most part, text is just text. One advantage of R Markdown is that the
vast majority of your document will be stuff you just type as you ordinarily
would.
__Rmarkdown = Vanilla Markdown + R code.__
# Selected tips
## Title, Author, Date, Output Format
You can specify things like title, author and date in the **header** of your
R Markdown file. This goes at the very beginning of the file, preceded
and followed by lines containing three dashes. Thus the beginning of
this file looks like so:
```
---
title: Using R Markdown for Class Reports
author: DJM
date: "`r format(Sys.time(), '%d %B, %Y')`"
---
```
You can also use the header to tell R Markdown whether you want it to render to
HTML (the default), PDF, or something else. To have this turned into PDF,
for instance, I'd write
```
---
title: Using R Markdown for Class Reports
author: DJM
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: pdf_document
---
```
* This header information is called `YAML` if you need the [Google](http://www.google.com).
## Including Code
### Code Chunks and Their Results
A code **chunk** is simply an off-set piece of code by itself. It is preceded
by ` ```{r} ` on a line by itself, and ended by a line which just says ` ``` `.
The code itself goes in between. Here, for instance, is some code which
loads a data set from a library, and makes a scatter plot.
```{r}
library(MASS)
data(cats)
plot(Hwt ~ Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)")
```
### Inline Code
* This is code not set off on a line by itself, but beginning with
` `r ` and ending with ` ` `. Using inline code is how this document
knows that the `cats` data set contains `r nrow(cats)` rows, and that the
median weight of the female cats' hearts was `r median(cats$Hwt[cats$Sex=="F"])`
grams.
* Notice that inline code does _not_ display the commands run, just their output.
## Seen But Not Heard
* Code chunks (but not inline code) can take a lot of __options__ which modify
how they are run, and how they appear in the document.
* These options go after
the initial `r` and before the closing `}` that announces the start of a code
chunk. One of the most common options turns off printing out the code,
but leaves the results alone:
` ```{r, echo=FALSE} `
* Another runs the code, but includes neither the text of the code nor its output.
` ```{r, include=FALSE} `
* This might seem pointless, but it can be useful for code chunks which do set-up
like loading data files, or initial model estimates, etc.
* Another option prints the code in the document, but does not run it:
` ```{r, eval=FALSE} `
This is useful if you want to talk about the (nicely formatted) code.
* You can give chunks names immediately after their opening, like
` ```{r clevername} `. This name is then used for the images (or other files)
that are generated when the document is rendered.
## Tables
The default print-out of matrices, tables, etc. from R Markdown is frankly
ugly. The `knitr` package contains a very basic command, `kable`, which will
format an array or data frame more nicely for display.
Compare:
```{r}
coefficients(summary(lm(Hwt ~ Bwt, data=cats)))
```
with
```{r}
library(knitr) # Only need this the first time!
kable(coefficients(summary(lm(Hwt ~ Bwt, data=cats))))
```
--- Of course, R's defaults print out a crazy number of decimal places. So use
```{r}
kable(coefficients(summary(lm(Hwt ~ Bwt, data=cats))),digits=3)
```
## "Caching" Code Chunks (Re-Running Only When Changed)
* By default, R Markdown will re-run all of your code every time you render your
document.
* If some of your code is slow, this can add up to a lot of time. You
can, however, ask R Markdown to keep track of whether a chunk of code has
changed, and only re-run it if it has.
* This is called __caching__ the chunk.
```{r, cache=FALSE}
lm(Hwt ~ Bwt, data=cats)
```
## Setting Defaults for All Chunks
You can tell R to set some defaults to apply to all chunks where you don't
specifically over-ride them. Here are the ones I generally use:
```{r setup, eval=FALSE}
# Need the knitr package to set chunk options
library(knitr)
# Set knitr options for knitting code into the report:
# - Don't print out code (echo)
# - Save results so that code blocks aren't re-run unless code changes (cache),
# _or_ a relevant earlier code block changed (autodep), but don't re-run if the
# only thing that changed was the comments (cache.comments)
# - Don't clutter R output with messages or warnings (message, warning)
# This _will_ leave error messages showing up in the knitted report
opts_chunk$set(echo=FALSE, fig.align='center',
cache=TRUE, autodep=TRUE, cache.comments=FALSE,
message=FALSE, warning=FALSE)
```
* This sets some additional options beyond the ones I've discussed,
like not re-running a chunk if only the comments have changed
(`cache.comments = FALSE`), and leaving out messages and warnings.
* You can over-ride these defaults by setting options for individual chunks.
## Math in R Markdown
This is a bit complicated because the behavior (unfortunately) depends on the output type.
Render pdf, just use LaTeX as usual (but no `$$` use `\[, \]`).
Rendering HTML, essentially LaTeX, but some idiosyncrasies.
Rendering Word, $\ldots$.
__Point out the Cheat sheet.__
## More details
Lots of details in the __rmarkdownDetails__ file in the lectures repo.
# My workflow
## The basics
1. I do all class documents in Rmarkdown. Notes, slides, etc.
2. When working out new code for a project, I use a combination of R package (as I get close to completion) and Rmarkdown.
- Many stat journals require reproducible, documented code. An R package is great for this.
- Rmarkdown alone lets me document as I go.
3. My students are required to give me reports in Rmarkdown.
## For professional presentations
I use Rmarkdown + Beamer:
```
---
title: "Statistical implications of (some) computational approximations"
author: |
| Daniel J. McDonald
| Indiana University, Bloomington
| \alo{\footnotesize \url{mypage.iu.edu/~dajmcdon}}
date: |
| 24 January 2018
output:
beamer_presentation:
keep_tex: yes
fig_caption: no
includes:
in_header: ApproxStatMacros.tex
classoption: "aspectratio=43"
---
```
The `includes` statement loads all the LaTeX macros I need, specifies Beamer choices.
## The old presentation
1. Write lots of LaTeX, R code in separate files
2. Need a figure. Run R code, get figure, save as pdf.
3. Recompile LaTeX. Axes are unreadable. Back to R, rerun R code,...
4. Recompile LaTeX. Can't distinguish lines. Back to R, rerun R code,...
5. Etc, etc.
Now:
Code and LaTeX live in one file. I just recompile.
## The old manuscript
Similar to the old presentation.
Now:
1. R package with documented code, available on Github.
2. One script to run the analysis, one to gather the results.
3. One Rmd file to take in the results, do preprocessing, generate all figures.
4. LaTeX file on Journal style.
## The optimal
Same as above but with a Makefile to automatically run 1-4.
# Rmarkdown practice
## Recall Logistic regression
Suppose $Y=1$ with probability $p(x)$ and $Y=0$ with probability $1-p(x)$.
I want to model $P(Y=1| X=x)$.
I'll assume that $p(x)/(1-p(x)) = ax$ for some scalar $a$.
We're going to estimate $a$ given data.
## Recall gradient descent
Want to solve
\[
\min_y f(y).
\]
You can calculate $f'(y)$.
Procedure:
1. Input a guess `y(0)`, a parameter `gam`.
2. Until I say stop: `y(t) <-- y(t-1) - gam * f'(y(t-1)); t++`.
3. Stop.
## Exercise
1. Rstudio > New Project...
2. Clone Github repo "week1"
3. Rename "[week1.Rmd]" to "[favorite_new_name.Rmd]"
3. Use the .Rmd document to implement GD for 1-parameter logistic regression.
5. Push your results back to the repo
Notes:
\[
\begin{aligned}
p(z) &= \frac{1}{1 + \exp(-z)}\\
f(a,y,x) &= \sum_i y_i \log(p(ax_i)) + (1-y_i) \log(1-p(ax_i))\\
f'(a,y,x) &= -\sum_i (y_i-p(ax_i))x_i
\end{aligned}
\]