forked from susanli2016/Data-Analysis-with-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
visa.Rmd
243 lines (197 loc) · 10.5 KB
/
visa.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "Explore 2016 H-1B Visa Petitions"
output: html_document
---
Susan Li
April 12, 2017
[The H-1B program](https://www.foreignlaborcert.doleta.gov/h-1b.cfm) allows employers to temporarily employ foreign workers in the U.S on a nonimmigrant basis in specialty occupations. This is the most common visa status applied by international students after they complete higher education in the U.S and work in a full-time position. For those graduates to apply for H-1B visa, their employers must offer a job and petition for H-1B visa with the US immigration department.
The [Office of Foreign Labor Certification (OFLC)](https://www.foreignlaborcert.doleta.gov/performancedata.cfm) generates program data. However, I downloaded the dataset from [Kaggle](https://www.kaggle.com/datasets) directly after it has been mostly cleaned. To make it as relevant as possible, I will be only looking at the data from 2016.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
I decide to remove all missing values, so about 3% of applications were omitted.
```{r}
visa <- read.csv('h1b_kaggle.csv', stringsAsFactors = F)
visa <- visa[ which(visa$YEAR==2016),]
visa <- visa[complete.cases(visa[,-1]),]
str(visa)
```
### Throughout the analysis, I will attempt to answer the following questions:
* What are the top occupations America needs the most?
* Who are the top employers that submit the most applications?
* Which employers pay the most for what job?
* Which states and cities hire the most H-1B visa workers?
```{r}
library(dplyr)
library(ggthemes)
library(ggplot2)
visa <- mutate_each(visa, funs(toupper))
visa$PREVAILING_WAGE <- as.numeric(visa$PREVAILING_WAGE)
visa$YEAR <- as.numeric(visa$YEAR)
visa$lon <- as.numeric(visa$lon)
visa$lat <- as.numeric(visa$lat)
occu_group <- group_by(visa, SOC_NAME)
visa_by_occu <- dplyr::summarize(occu_group,
count = n(),
mean = mean(PREVAILING_WAGE))
visa_by_occu <- visa_by_occu[with(visa_by_occu, order(-count)), ]
visa_by_occu <- visa_by_occu[1:20, ]
visa_by_occu$count[2] <- 114102
visa_by_occu= visa_by_occu[-6,]
```
### What are the top occupations?
```{r}
ggplot(aes(x = reorder(SOC_NAME, count), y=count), data = visa_by_occu) +
geom_bar(stat = 'identity') + coord_flip() +
xlab('Occupantions') +
ylab('Number of Applications') +
theme(axis.text = element_text(size = 8),
plot.title = element_text(size = 12)) +
ggtitle('Top H-1B Visa Occupations 2016')
```
Technology related professions such as software developer, computer system analyst, programmer are among the most in demand occupations, analyst, accountant, engineer are among the second most in demand occupations.
### Who are the top employers that submit the most application?
```{r}
employer_group <- group_by(visa, EMPLOYER_NAME)
visa_by_employer <- dplyr::summarize(employer_group,
count = n(),
mean = mean(PREVAILING_WAGE))
visa_by_employer <- visa_by_employer[with(visa_by_employer, order(-count)), ]
visa_by_employer <- visa_by_employer[1:20, ]
ggplot(aes(x = reorder(EMPLOYER_NAME, count), y=count), data = visa_by_employer) +
geom_bar(stat = 'identity') + coord_flip() +
xlab('Employers') +
ylab('Number of Applications') +
theme(axis.text = element_text(size = 7),
plot.title = element_text(size = 10)) +
ggtitle('Top Employers for H-1B Visa Application 2016')
```
Infosys Limited leads by a large margin and submitted over 25000 applications last year. As a matter of fact, eight of the top 20 employers are Indian multinational IT companies.
```{r}
employer_job_group <- group_by(visa, JOB_TITLE, EMPLOYER_NAME)
visa_by_employer_job <- dplyr::summarize(employer_job_group,
count = n(),
mean = mean(PREVAILING_WAGE))
visa_by_employer_job <- visa_by_employer_job[with(visa_by_employer_job, order(-count)), ]
visa_by_employer_job <- visa_by_employer_job[1:20, ]
ggplot(aes(x = reorder(JOB_TITLE, count), y=count, fill=EMPLOYER_NAME), data = visa_by_employer_job) +
geom_bar(stat = 'identity', position = position_dodge()) +
coord_flip() +
ylab('Number of Applications') +
xlab('Job Title') +
theme(axis.text = element_text(size = 8),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4),
legend.justification=c(1,0), legend.position=c(1,0), legend.title=element_text(size=8),
legend.text=element_text(size=7),
plot.title = element_text(size = 10)) +
ggtitle('Top Job Titles for H-1B Visa Application 2016')
```
Technology leads and technology analysts are in huge demand at Infosys, developers and programmers are liked by Tata, Google is mainly interested in software engineers. Deloitte and Ernst & Young apply visa for consultants and advisors.
### What occupations make the most Money?
When I look at prevailing wage distribution, I found something interesting.
```{r}
summary(visa$PREVAILING_WAGE)
```
Minimum wage is 0 and maximum wage is 329100000. I suspected '0's were missing values, and there are not many of them. But let's have a look which company offered $329100000.
```{r}
visa[which.max(visa$PREVAILING_WAGE),]
```
For a marketing manager? I don't understand. The application was denied anyway.
```{r}
ggplot(aes(x = PREVAILING_WAGE), data = visa) +
scale_x_continuous(limits = c(40000, 250000)) +
geom_histogram(aes(y = ..density..),bins=50) + geom_density(color='blue') +
xlab('Prevailing Wage(USD)') +
ggtitle('Prevailing Wage Distribution 2016')
```
Majority of the prevailing wages were between 50K and 100K USD per annum.
```{r}
ggplot(aes(x = reorder(SOC_NAME, mean), y=mean), data = visa_by_occu) +
geom_bar(stat = 'identity') + coord_flip() +
xlab('Occupantions') +
ylab('Average Prevailing Wage(USD)') +
theme(axis.text = element_text(size = 8),
plot.title = element_text(size = 10)) +
ggtitle('Top Wage H-1B Visa Occupations 2016')
```
physicians and surgeons enjoy the highest average prevailing wages that almost reach $175K per annum last year, computer information systems managers and electrical engineers take the second spot make approximate $163K per annum last year.
### Which Employers pay the most?
```{r}
ggplot(aes(x = reorder(JOB_TITLE, mean), y=mean, fill=EMPLOYER_NAME), data = visa_by_employer_job) +
geom_bar(stat = 'identity', position = position_dodge()) +
coord_flip() +
ylab('Average Prevailing Wage(USD)') +
xlab('Job Title') +
theme(axis.text = element_text(size = 8),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = .4),
legend.justification=c(1,0), legend.position=c(1,0), legend.title=element_text(size=8), legend.text=element_text(size=7),
plot.title = element_text(size = 9)) +
ggtitle('Top Job Titles and Average Wages for H-1B Visa Application 2016')
```
When comes to the job title and employer, consultants hired by Deloitte enjoy the highest average prevailing wage, software engineer hird by Google are paid far more than the same job title hired by the other companeis.
```{r}
top_employer_df <- filter(visa, EMPLOYER_NAME %in% visa_by_employer[['EMPLOYER_NAME']])
ggplot(aes(x = EMPLOYER_NAME, y = PREVAILING_WAGE), data = top_employer_df) +
geom_boxplot() +
coord_flip(ylim=c(0,150000)) +
xlab('Employers') +
ylab('Prevailing Wage(USD)') +
ggtitle('Wage by Top 20 Employers')
by(top_employer_df$PREVAILING_WAGE, top_employer_df$EMPLOYER_NAME, summary)
```
Microsoft and Google offer the highest wages to their H-1B Visa workers. Their median wage exceeded $100K per annum, while the median wage of UST Global and
L&T Technology are less than $60K per annum.
### Which states and cities apply the most H-1B visas?
```{r}
library(stringr)
visa$CITY <- str_replace(visa$WORKSITE, '(.+),.+', '\\1')
visa$STATE <- str_replace(visa$WORKSITE, '.+,(.+)', '\\1')
state_group <- group_by(visa, STATE)
visa_by_state <- dplyr::summarize(state_group,
count = n(),
mean = mean(PREVAILING_WAGE))
visa_by_state <- visa_by_state[with(visa_by_state, order(-count)), ]
visa_by_state <- visa_by_state[1:20, ]
ggplot(aes(x = reorder(STATE, count), y = count), data = visa_by_state) +
geom_bar(stat = 'identity') + coord_flip() +
xlab('State') +
ylab('Number of Applications') +
ggtitle('Top States Apply the Most H-1B Visas')
```
Expectedly, California hires the most workers on H-1B visas, followed by Texas, New York, New Jersey and Illinois.
```{r}
city_group <- group_by(visa, CITY)
visa_by_city <- dplyr::summarize(city_group,
count = n(),
mean = mean(PREVAILING_WAGE))
visa_by_city <- visa_by_city[with(visa_by_city, order(-count)), ]
visa_by_city <- visa_by_city[1:20, ]
ggplot(aes(x = reorder(CITY, count), y = count), data = visa_by_city) +
geom_bar(stat = 'identity') + coord_flip() +
xlab('City') +
ylab('Number of Applications') +
ggtitle('Top Cities Apply the Most H-1B Visas')
```
This time New York City takes the lead by a large margin in the number of H-1B Visa applications. Not only high tech companies in New York hire H-1B visa workers, but also New York’s fashion industry is heavily reliant on immigrants, from top designers to creative staffs to the sewing workers.
### A Map of H-1B Visa Applications
```{r}
library(maps)
library(RColorBrewer)
state_visa <- dplyr::summarize(state_group,
count = n())
```
```{r}
visa_state <- read.csv('visa_state.csv', stringsAsFactors = F)
rownames(visa_state) <- visa_state$X
visa_state$X <- NULL
```
```{r}
states_map <- map_data("state")
ggplot(visa_state, aes(map_id = region)) +
geom_map(aes(fill = count), map = states_map) +
scale_colour_brewer(palette='Greens') +
expand_limits(x = states_map$long, y = states_map$lat) +
xlab('longitude') + ylab('latitude') + ggtitle('State H-1B Visa Applications')
```
### The End
It has been a great experience exploring 2016 H-1B visa petition data. With [Trump's Cracking down on the H-1B Visa program that Silicon Valley loves](https://www.recode.net/2017/4/3/15164358/trump-white-house-foreign-immigration-h1b-tech-hiring-crackdown), I can't wait to learn the data for 2017.