forked from andkret/Cookbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Data Engineering Cookbook.tex
2906 lines (1856 loc) · 129 KB
/
Data Engineering Cookbook.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt, numbers=noenddot]{scrreprt} %use komascript instead of just article
\setkomafont{disposition}{\normalcolor\bfseries} %use the standard article look
\usepackage[margin=1in]{geometry}
\usepackage{graphicx}
\usepackage{cite}
\usepackage{amsthm, amsmath, amssymb}
\usepackage{setspace}\onehalfspacing
%\usepackage{setspace}\doublespacing
\usepackage[loose,nice]{units} %replace "nice" by "ugly" for units in upright fractions
\usepackage{parskip} % dont't need to du double backslash for linebreak
\usepackage[linktocpage=true]{hyperref}
\usepackage{tabularx} % in the preamble
\usepackage{listings}
\usepackage{color}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
% commentstyle=\color{codegreen},
% keywordstyle=\color{magenta},
% numberstyle=\tiny\color{codegray},
% stringstyle=\color{codepurple},
% basicstyle=\footnotesize,
% breakatwhitespace=false,
% breaklines=true,
% captionpos=b,
% keepspaces=true,
% numbers=left,
% numbersep=5pt,
% showspaces=false,
% showstringspaces=false,
% showtabs=false,
% tabsize=2
}
\lstset{style=mystyle}
\usepackage{hyperref}
\hypersetup{
colorlinks=true,
linkcolor=blue,
filecolor=magenta,
urlcolor=cyan,
}
\urlstyle{same}
\title{The Data Engineering Cookbook}
\subtitle{Mastering The Plumbing Of Data Science}
\author{Andreas Kretz}
\date{\today\\v2.1}
\begin{document}
\maketitle
%\begin{abstract}
%Purpous of this document and its conclusion
%\end{abstract}
\pagebreak
\setcounter{tocdepth}{3}
\tableofcontents
\pagebreak
\part{Introduction}
\chapter{How To Use This Cookbook}
What do you actually need to learn to become an awesome data engineer?
Look no further, you'll find it here.
If you are looking for AI algorithms and such data scientist things, this book is not for you.
\textbf{How to use this document:} \\
First of all, this is not a training! This cookbook is a collection of skills that I value highly in my daily work as a data engineer. It's intended to be a starting point for you to find the topics to look into and become an awesome data engineer.
You are going to find \textbf{\textit{Five Types of Content}} in this book: Articles I wrote, links to my podcast episodes (video \& audio), more then 200 links to helpful websites I like, data engineering interview questions and case studies.
\textbf{This book is a work in progress!} \\
As you can see, this book is not finished. I'm constantly adding new stuff and doing videos for the topics. But obviously, because I do this as a hobby my time is limited. You can help making this book even better.
\textbf{Help make this book awesome!}\\
If you have some cool links or topics for the cookbook, please become a contributor on GitHub: \url{https://github.com/andkret/Cookbook}. Pull the repo, add them and create a pull request. Or join the discussion by opening Issues.
You can also write me an email any time to plumbersofdatascience@gmail.com. Tell me your thoughts, what you value, you think should be included, or correct where I am wrong.
\textbf{This Cookbook is and will always be free! }\\
I don't want to sell you this book, but please support what you like and join my Patreon: \url{https://www.patreon.com/plumbersofds}
Check out this podcast episode where I talk in detail why I decided to share all this information for free:
\href{https://youtu.be/k1bS5aSPos8}{\#079 Trying to stay true to myself and making the cookbook public on GitHub}
\pagebreak
\chapter{Data Engineer vs Data Scientists}
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#050 Data Engineer Scientist or Analyst Which One Is For You?} \\ \hline
\multicolumn{2}{p{15cm}}{In this podcast we talk about the differences between data scientists, analysts and engineers. Which are the three main data science jobs. All three super important. This makes it easy to decide} \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/64TYZETOEdQ}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/050-Data-Engineer-Scientist-or-Analyst-Which-One-Is-For-You-e45ibl}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 050 Data Engineer Scientist or Analyst Which One Is For You?} %\label{tbl:spotifycasestudy}
\end{table}
\section{Data Scientist}
Data scientists aren’t like every other scientist.
Data scientists do not wear white coats or work in high tech labs full of science fiction movie equipment. They work in offices just like you and me.
What differs them from most of us is that they are the math experts. They use linear algebra and multivariable calculus to create new insight from existing data.
How exactly does this insight look?
Here’s an example:
An industrial company produces a lot of products that need to be tested before shipping.
Usually such tests take a lot of time because there are hundreds of things to be tested. All to make sure that your product is not broken.
Wouldn’t it be great to know early if a test fails ten steps down the line? If you knew that you could skip the other tests and just trash the product or repair it.
That’s exactly where a data scientist can help you, big-time. This field is called predictive analytics and the technique of choice is machine learning.
Machine what? Learning?
Yes, machine learning, it works like this:
You feed an algorithm with measurement data.
It generates a model and optimises it based on the data you fed it with. That model basically represents a pattern of how your data is looking
You show that model new data and the model will tell you if the data still represents the data you have trained it with.
This technique can also be used for predicting machine failure in advance with machine learning. Of course the whole process is not that simple.
The actual process of training and applying a model is not that hard. A lot of work for the data scientist is to figure out how to pre-process the data that gets fed to the algorithms.
Because to train a algorithm you need useful data. If you use any data for the training the produced model will be very unreliable.
A unreliable model for predicting machine failure would tell you that your machine is damaged even if it is not. Or even worse: It would tell you the machine is ok even when there is an malfunction.
Model outputs are very abstract. You also need to post-process the model outputs to receive health values from 0 to 100.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Machine-Learning-Pipeline}
\caption{The Machine Learning Pipeline}
\label{fig:Bild1}
\end{figure}
\section{Data Engineer}
Data Engineers are the link between the management’s big data strategy and the data scientists that need to work with data.
What they do is building the platforms that enable data scientists to do their magic.
These platforms are usually used in five different ways:
\begin{itemize}
\item Data ingestion and storage of large amounts of data
\item Algorithm creation by data scientists
\item Automation of the data scientist’s machine learning models and algorithms for production use
\item Data visualisation for employees and customers
\item Most of the time these guys start as traditional solution architects for systems that involve SQL databases, web servers, SAP installations and other “standard” systems.
\end{itemize}
But to create big data platforms the engineer needs to be an expert in specifying, setting up and maintaining big data technologies like: Hadoop, Spark, HBase, Cassandra, MongoDB, Kafka, Redis and more.
What they also need is experience on how to deploy systems on cloud infrastructure like at Amazon or Google or on premise hardware.
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#048 From Wannabe Data Scientist To Engineer My Journey} \\ \hline
\multicolumn{2}{p{15cm}}{In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science.} \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/pIZkTuN5AMM}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/048-From-Wannabe-Data-Scientist-To-Engineer-My-Journey-e45i2o}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 048 From Wannabe Data Scientist To Engineer My Journey} %\label{tbl:spotifycasestudy}
\end{table}
\section{Who Companies Need}
For a good company it is absolutely important to get well trained data engineers and data scientists. Think of the data scientist as the professional race car driver. A fit athlete with talent and driving skills like you have never seen.
What he needs to win races is someone who will provide him the perfect race car to drive. That’s what the solution architect is for.
Like the driver and his team the data scientist and the data engineer need to work closely together. They need to know the different big data tools Inside and out.
That's why companies are looking for people with Spark experience. It is a common ground between both that drives innovation.
Spark gives data scientists the tools to do analytics and helps engineers to bring the data scientist’s algorithms into production. After all, those two decide how good the data platform is, how good the analytics insight is and how fast the whole system gets into a production ready state.
\part{Basic Data Engineering Skills}
\chapter{Learn To Code}
Why this is important: Without coding you cannot do much in data engineering. I cannot count the number of times I needed a quick Java hack.
The possibilities are endless:
\begin{itemize}
\item Writing or quickly getting some data out of a SQL DB
\item Testing to produce messages to a Kafka topic
\item Understanding Source code of a Java Webservice
\item Reading counter statistics out of a HBase key value store
\end{itemize}
So, which language do I recommend then?
I highly recommend Java. It’s everywhere!
When you are getting into data processing with Spark you should use Scala. But, after learning Java this is easy to do.
Also Python is a great choice. It is super versatile.
Personally however, I am not that big into Python. But I am going to look into it
Where to Learn?
There’s a Java Course on Udemy you could look at: https://www.udemy.com/java-programming-tutorial-for-beginners
\begin{itemize}
\item OOP Object oriented programming
\item What are Unit tests to make sure what you code is working
\item Functional Programming
\item How to use build management tools like Maven
\item Resilient testing (?)
\end{itemize}
I talked about the importance of learning by doing in this podcast:
\url{https://anchor.fm/andreaskayy/episodes/Learning-By-Doing-Is-The-Best-Thing-Ever---PoDS-035-e25g44}
\chapter{Get Familiar With Git}
Why this is important: One of the major problems with coding is to keep track of changes. It is also almost impossible to maintain a program you have multiple versions of.
Another is the topic of collaboration and documentation. Which is super Important.
Let’s say you work on a Spark application and your colleagues need to make changes while you are on holiday. Without some code management they are in huge trouble:
Where is the code? What have you changed last? Where is the documentation? How do we mark what we have changed?
But if you put your code on GitHub your colleagues can find your code. They can understand it through your documentation (please also have in-line comments)
Developers can pull your code, make a new branch and do the changes. After your holiday you can inspect what they have done and merge it with your original code. and you end up having only one application
Where to learn:
Check out the GitHub Guides page where you can learn all the basics: https://guides.github.com/introduction/flow/
This great GitHub commands cheat sheet saved my butt multiple times: https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet
Also look into:
\begin{itemize}
\item Pull
\item Push
\item Branching
\item Forking
\end{itemize}
\chapter{Agile Development -- available}
Agility, the ability to adapt quickly to changing circumstances.
These days everyone wants to be agile. Big or small company people are looking for the “startup mentality”.
Many think it’s the corporate culture. Others think it’s the process how we create things that matters.
In this article I am going to talk about agility and self-reliance. About how you can incorporate agility in your professional career.
\section{Why is agile so important?}
Historically development is practiced as a hard defined process. You think of something, specify it, have it developed and then built in mass production.
It’s a bit of an arrogant process. You assume that you already know exactly what a customer wants. Or how a product has to look and how everything works out.
The problem is that the world does not work this way!
Often times the circumstances change because of internal factors.
Sometimes things just do not work out as planned or stuff is harder than you think.
You need to adapt.
Other times you find out that you build something customers do not like and need to be changed.
You need to adapt.
That’s why people jump on the Scrum train. Because Scrum is the definition of agile development, right?
\section{Agile rules I learned over the years -- available}
\subsection{Is the method making a difference?}
Yes, Scrum or Google’s OKR can help to be more agile. The secret to being agile however, is not only how you create.
What makes me cringe is people try to tell you that being agile starts in your head. So, the problem is you?
No!
The biggest lesson I have learned over the past years is this: Agility goes down the drain when you outsource work.
\subsection{The problem with outsourcing}
I know on paper outsourcing seems like a no brainer: Development costs against the fixed costs.
It is expensive to bind existing resources on a task. It is even more expensive if you need to hire new employees.
The problem with outsourcing is that you pay someone to build stuff for you.
It does not matter who you pay to do something for you. He needs to make money.
His agenda will be to spend as less time as possible on your work. That is why outsourcing requires contracts, detailed specifications, timetables and delivery dates.
He doesn’t want to spend additional time on a project, only because you want changes in the middle. Every unplanned change costs him time and therefore money.
If so, you need to make another detailed specification and a contract change.
He is not going to put his mind into improving the product while developing. Firstly because he does not have the big picture. Secondly because he does not want to.
He is doing as he is told.
Who can blame him? If I was the subcontractor I would do exactly the same!
Does this sound agile to you?
\subsection{Knowledge is king: A lesson from Elon Musk}
Doing everything in house, that’s why startups are so productive. No time is wasted on waiting for someone else.
If something does not work, or needs to be changed, there is someone in the team who can do it right away. .
One very prominent example who follows this strategy is Elon Musk.
Tesla’s Gigafactories are designed to get raw materials in on one side and spit out cars on the other. Why do you think Tesla is building Gigafactories who cost a lot of money?
Why is SpaceX building its one space engines? Clearly there are other, older, companies who could do that for them.
Why is Elon building tunnel boring machines at his new boring company?
At first glance this makes no sense!
\subsection{How you really can be agile}
If you look closer it all comes down to control and knowledge. You, your team, your company, needs to do as much as possible on your own. Self-reliance is king.
Build up your knowledge and therefore the teams knowledge. When you have the ability to do everything yourself, you are in full control.
You can build electric cars, rocket engines or bore tunnels.
Don’t largely rely on others and be confident to just do stuff on your own.
Dream big and JUST DO IT!
PS. Don’t get me wrong. You can still outsource work. Just do it in a smart way by outsourcing small independent parts.
\section{Agile Frameworks}
\subsection{Scrum}
There's a interesting Scrum Medium publication with a lot of details about Scrum: \url{https://medium.com/serious-scrum}
Also this scrum guide webpage has good infos about Scrum: \url{https://www.scrumguides.org/scrum-guide.html}
\subsection{OKR}
I personally love OKR, been doing it for years. Especially for smaller teams OKR is great. You don't have a lot of overhead and get work done.
It helps you stay focused and look at the bigger picture.
I recommend to do a sync meeting every Monday. There you talk about what happened last week and what you are going to work on this week.
I talked about this in this Podcast: \url{https://anchor.fm/andreaskayy/embed/episodes/Agile-Development-Is-Important-But-Please-Dont-Do-Scrum--PoDS-041-e2e2j4}
This is also this awesome 1,5 hours Startup guide from Google: \url{https://youtu.be/mJB83EZtAjc} I really love this video, I rewatched it multiple times.
\section{Software Engineering Culture}
The software engineering and development culture is super important. How does a company handle product development with hundreds of developers. Check out this podcast:
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#070 Engineering Culture At Spotify} \\ \hline
\multicolumn{2}{p{15cm}}{In this podcast we look at the engineering culture at Spotify, my favorite music streaming service.
The process behind the development of Spotify is really awesome.} \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/1asVrsUDbp0}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/070-The-Engineering-Culture-At-Spotify-e45ipa}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 070 Engineering Culture At Spotify} %\label{tbl:spotifycasestudy}
\end{table}
\textbf{Some interesting slides:}
\url{https://labs.spotify.com/2014/03/27/spotify-engineering-culture-part-1/}
\url{https://labs.spotify.com/2014/09/20/spotify-engineering-culture-part-2/}
\chapter{Learn how a Computer Works}
\section{CPU,RAM,GPU,HDD}
\section{Differences between PCs and Servers}
I talked about computer hardware and GPU processing in this podcast: \url{https://anchor.fm/andreaskayy/embed/episodes/Why-the-hardware-and-the-GPU-is-super-important--PoDS-030-e23rig}
\chapter{Computer Networking - Data Transmission}
\section{OSI Model}
The OSI Model describes how data is flowing through the network. It consists of layers starting from physical layers, basically how the data is transmitted over the line or optic fiber.
Cisco page that shows the layers of the OSI model and how it works: \url{https://learningnetwork.cisco.com/docs/DOC-30382}
Check out this page: \url{https://www.studytonight.com/computer-networks/complete-osi-model}
The Wikipedia page is also very good: \url{https://en.wikipedia.org/wiki/OSI_model}
\paragraph{Which protocol lives on which layer?} Check out this network protocol map. Unfortunately it is really hard to find it theses days:
\url{https://www.blackmagicboxes.com/wp-content/uploads/2016/12/Network-Protocols-Map-Poster.jpg}
\section{IP Subnetting}
Check out this IP Adress and Subnet guide from Cisco: \url{https://www.cisco.com/c/en/us/support/docs/ip/routing-information-protocol-rip/13788-3.html}
A calculator for Subnets: \url{https://www.calculator.net/ip-subnet-calculator.html}
\section{Switch, Level 3 Switch}
\section{Router}
\section{Firewalls}
I talked about Network Infrastructure and Techniques in this podcast: \url{https://anchor.fm/andreaskayy/embed/episodes/IT-Networking-Infrastructure-and-Linux-031-PoDS-e242bh}
\chapter{Security and Privacy}
\section{SSL Public \& Private Key Certificates}
\section{What is a certificate authority}
\section{JSON Web Tokens}
Link to the Wiki page: \url{https://en.wikipedia.org/wiki/JSON_Web_Token}
\section{GDPR regulations}
\section{Privacy by design}
\chapter{Linux}
Linux is very important to learn, at least the basics. Most Big Data tools or NoSQL databases are running on Linux.
From time to time you need to modify stuff through the operation system. Especially if you run a infrastructure as a service solution like Cloudera CDH, Hortonworks or a MapR Hadoop distribution
\section{OS Basics}
Show all historic commands
\begin{lstlisting}
history | grep docker
\end{lstlisting}
\section{Shell scripting}
\section{Cron jobs}
Cron jobs are super important to automate simple processes or jobs in Linux. You need this here and there I promise.
Check out this three guides:
\url{https://linuxconfig.org/linux-crontab-reference-guide}
\url{https://www.ostechnix.com/a-beginners-guide-to-cron-jobs/}
And of course Wikipedia, which is surprisingly good: \url{https://en.wikipedia.org/wiki/Cron}
Pro tip: Don't forget to end your cron files with an empty line or a comment, otherwise it will not work.
\section{Packet management}
Linux Tips are the second part of this podcast: \url{https://anchor.fm/andreaskayy/embed/episodes/IT-Networking-Infrastructure-and-Linux-031-PoDS-e242bh}
\chapter{The Cloud}
\section{IaaS vs PaaS vs SaaS}
Check out this Podcast it will help you understand where's the difference and how to decide on what you are going to use.
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#082 Reading Tweets With Apache Nifi \& IaaS vs PaaS vs SaaS} \\ \hline
\multicolumn{2}{p{15cm}}{In this episode we are talking about the differences between infrastructure as a service, platform as a service and application as a service. Then we install the Nifi docker container and look into how we can extract the twitter data.} \\ \hline
\multicolumn{1}{l|}{Youtube} & \href{https://youtu.be/pWuT4UAocUY}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/082-Reading-Tweets-With-Apache-Nifi--IaaS-vs-PaaS-vs-SaaS-e45j50}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 082 Reading Tweets With Apache Nifi \& IaaS vs PaaS vs SaaS} %\label{tbl:spotifycasestudy}
\end{table}
\section{AWS,Azure, IBM, Google Cloud basics}
\section{Cloud vs On-Premises}
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#076 Cloud vs On-Premise} \\ \hline
\multicolumn{2}{p{15cm}}{How do you choose between Cloud vs On-Premises, pros and cons and what you have to think about. Because there are good reasons to not go cloud.
Also thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise the comparison will drive you insane.
My suggestion: Basically use them as IaaS and something like Cloudera as PaaS. Then build your solution on top of that. } \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/BAzj0yGcrnE}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/076-Cloud-vs-On-Premise-How-To-Decide-e45ivk}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 076 Cloud vs On-Premise} %\label{tbl:spotifycasestudy}
\end{table}
\section{Security}
Listen to a few thoughts about the cloud in this podcast:
\url{https://anchor.fm/andreaskayy/embed/episodes/Dont-Be-Arrogant-The-Cloud-is-Safer-Then-Your-On-Premise-e16k9s}
\section{Hybrid Clouds}
Hybrid clouds are a mixture of on-premises and cloud deployment. A very interesting example for this is Google Anthos:
\url{https://cloud.google.com/anthos/}
\chapter{Security Zone Design}
\section{How to secure a multi layered application}
(UI in different zone then SQL DB)
\section{Cluster security with Kerberos}
I talked about security zone design and lambda architecture in this podcast: \url{https://anchor.fm/andreaskayy/embed/episodes/How-to-Design-Security-Zones-and-Lambda-Architecture--PoDS-032-e248q2}
\section{Kerberos Tickets}
\chapter{Big Data}
\section{What is big data and where is the difference to data science and data analytics?}
I talked about the difference in this podcast: https://anchor.fm/andreaskayy/embed/episodes/BI-vs-Data-Science-vs-Big-Data-e199hq
\section{The 4Vs of Big Data | available}
It is a complete misconception. Volume is only one part of the often called four V’s of big data: Volume, velocity, variety and veracity.
\textbf{Volume} is about the size. How much data you have.
\textbf{Velocity} is about the speed of how fast the data is getting to you.
How much data is in a specific time needs to get processed or is coming into the system. This is where the whole concept of streaming data and real-time processing comes in to play.
\textbf{Variety} is the third one. It means, that the data is very different. That you have very different types of data structures.
Like CSV files, PDFs that you have stuff in XML. That you have JSON logfiles, or that you have data in some kind of a key value store.
It’s about the variety of data types from different sources that you basically want to join together. All to make an analysis based on that data.
\textbf{Veracity} is fourth and this is a very very difficult one. The issue with big data is, that it is very unreliable.
You cannot really trust the data. Especially when you’re coming from the IoT, the Internet of Things side. Devices use sensors for measurement of temperature, pressure, acceleration and so on.
You cannot always be hundred percent sure that the actual measurement is right.
When you have data that is from for instance SAP and it contains data that is created by hand you also have problems. As you know we humans are bad at inputting stuff.
Everybody articulates different. We make mistakes, down to the spelling and that can be a very difficult issue for analytics.
I talked about the 4Vs in this podcast: https://anchor.fm/andreaskayy/embed/episodes/4-Vs-Of-Big-Data-Are-Enough-e1h2ra
\section{Why Big Data? | available}
What I always emphasize is the four V’s are quite nice. They give you a general direction.
There is a much more important issue: Catastrophic Success.
What I mean by catastrophic success is, that your project, your startup or your platform has more growth that you anticipated. Exponential growth is what everybody is looking for.
Because with exponential growth there is the money. It starts small and gets very big very fast. The classic hockey stick curve:
1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384….BOOM!
Think about it. It starts small and quite slow, but gets very big very fast.
You get a lot of users or customers who are paying money to use your service, the platform or whatever. If you have a system that is not equipped to scale and process the data the whole system breaks down.
That’s catastrophic success. You are so successful and grow so fast that you cannot fulfill the demand anymore. And so you fail and it’s all over.
It’s now like you just can make that up while you go. That you can foresee in a few months or weeks the current system doesn’t work anymore.
\subsection{Planning is Everything}
It’s all happens very very fast and you cannot react anymore. There’s a necessary type of planning and analyzing the potential of your business case necessary.
Then you need to decide if you actually have big data or not.
You need to decide if you use big data tools. This means when you conceptualize the whole infrastructure it might look ridiculous to actually focus on big data tools.
But in the long run it will help you a lot. Good planning will get a lot of problems out of the way, especially if you think about streaming data and real-time analytics.
\subsection{The Problem With ETL}
A typical old-school platform deployment would look like the picture below. Devices use a data API to upload data that gets stored in a SQL database. An external analytics tool is querying data and uploading the results back to the SQL db. Users then use the user interface to display data stored in the database.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{images/Common-SQL-Architecture}
\caption{Common SQL Platform Architecture}
\label{fig:Bild1}
\end{figure}
Now, when the front end queries data from the SQL database the following three steps happen:
The database extracts all the needed rows from the storage. Extracted data gets transformed, for instance sorted by timestamp or something a lot more complex.
The extracted and transformed data gets loaded to the destination (the user interface) for chart creation
With exploding amounts of stored data the ETL process starts being a real problem.
Analytics is working with large data sets, for instance whole days, weeks, months or more. Data sets are very big like 100GB or Terabytes. That means Billions or Trillions of rows.
This has the result that the ETL process for large data sets takes longer and longer. Very quickly the ETL performance gets so bad it won’t deliver results to analytics anymore.
A traditional solution to overcome these performance issues is trying to increase the performance of the database server. That’s what’s called scaling up.
\subsection{Scaling Up}
To scale up the system and therefore increase ETL speeds administrators resort to more powerful hardware by:
Speeding up the extract performance by adding faster disks to physically read the data faster.
Increasing RAM for row caching. What is already in memory does not have to be read by slow disk drives.
Using more powerful CPU’s for better transform performance (more RAM helps here as well)
Increasing or optimising networking performance for faster data delivery to the front end and analytics
Scaling up the system is fairly easy.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{images/SQL-Scaling-UP}
\caption{Scaling up a SQL Database}
\label{fig:Bild1}
\end{figure}
But with exponential growth it is obvious that sooner or later (more sooner than later) you will run into the same problems again. At some point you simply cannot scale up anymore because you already have a monster system, or you cannot afford to buy more expensive hardware.
The next step you could take would be scaling out.
\subsection{Scaling Out}
Scaling out is the opposite of scaling up. Instead of building bigger systems the goal is to distribute the load between many smaller systems.
The simplest way of scaling out an SQL database is using a storage area network (SAN) to store the data. You can then use up to eight SQL servers, attach them to the SAN and let them handle queries. This way load gets distributed between those eight servers.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{images/SQL-Scaling-Out}
\caption{Scaling out a SQL Database}
\label{fig:Bild1}
\end{figure}
One major downside of this setup is that, because the storage is shared between the sql servers, it can only be used as an read only database. Updates have to be done periodically, for instance once a day. To do updates all SQL servers have to detach from the database. Then, one is attaching the db in read-write mode and refreshing the data. This procedure can take a while if a lot of data needs to be uploaded.
This Link to a Microsoft MSDN page has more options of scaling out an SQL database for you.
I deliberately don’t want to get into details about possible scaling out solutions. The point I am trying to make is that while it is possible to scale out SQL databases it is very complicated.
There is no perfect solution. Every option has its up- and downsides. One common major issue is the administrative effort that you need to take to implement and maintain a scaled out solution.
\subsection{Please Don’t go Big Data}
If you don’t run into scaling issues please, do not use big data tools!
Big data is a expensive thing. A Hadoop cluster for instance needs at least five servers to work properly. More is better.
Believe me this stuff costs a lot of money.
Especially when you are talking the maintenance and development on top big data tools into account.
If you don’t need it it’s making absolutely no sense at all!
On the other side: If you really need big data tools they will save your ass :)
\chapter {My Big Data Platform Blueprint}
Some time ago I have created a simple and modular big data platform blueprint for myself. It is based on what I have seen in the field and read in tech blogs all over the internet.
Today I am going to share it with you.
Why do I believe it will be super useful to you?
Because, unlike other blueprints it is not focused on technology. It is based on four common big data platform design patterns.
Following my blueprint will allow you to create the big data platform that fits exactly your needs. Building the perfect platform will allow data scientists to discover new insights.
It will enable you to perfectly handle big data and allow you to make data driven decisions.
\paragraph{THE BLUEPRINT}
The blueprint is focused on the four key areas: Ingest, store, analyse and display.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Big-Data-Platform-Blueprint-Title-Original.png}
\caption{Platfrom Blueprint}
\label{fig:Bild1}
\end{figure}
Having the platform split like this turns it it a modular platform with loosely coupled interfaces.
Why is it so important to have a modular platform?
If you have a platform that is not modular you end up with something that is fixed or hard to modify. This means you can not adjust the platform to changing requirements of the company.
Because of modularity it is possible to switch out every component, if you need it.
Now, lets talk more about each key area.
\section{Ingest}
Ingestion is all about getting the data in from the source and making it available to later stages. Sources can be everything form tweets, server logs to IoT sensor data like from cars.
Sources send data to your API Services. The API is going to push the data into a temporary storage.
The temporary storage allows other stages simple and fast access to incoming data.
A great solution is to use messaging queue systems like Apache Kafka, RabbitMQ or AWS Kinesis. Sometimes people also use caches for specialised applications like Redis.
A good practice is that the temporary storage follows the publish, subscribe pattern. This way APIs can publish messages and Analytics can quickly consume them.
\section{Analyse / Process}
The analyse stage is where the actual analytics is done. Analytics, in the form of stream and batch processing.
Streaming data is taken from ingest and fed into analytics. Streaming analyses the “live” data thus, so generates fast results.
As the central and most important stage, analytics also has access to the big data storage. Because of that connection, analytics can take a big chunk of data and analyse it.
This type of analysis is called batch processing. It will deliver you answers for the big questions.
To learn more about stream and batch processing read my blog post: How to Create New and Exciting Big Data Aided Products
The analytics process, batch or streaming, is not a one way process. Analytics also can write data back to the big data storage.
Often times writing data back to the storage makes sense. It allows you to combine previous analytics outputs with the raw data.
Analytics insight can give meaning to the raw data when you combine them. This combination will often times allow you to create even more useful insight.
A wide variety of analytics tools are available. Ranging from MapReduce or AWS Elastic MapReduce to Apache Spark and AWS lambda.
\section{Store}
This is the typical big data storage where you just store everything. It enables you to analyse the big picture.
Most of the data might seem useless for now, but it is of upmost importance to keep it. Throwing data away is a big no no.
Why not throw something away when it is useless?
Although it seems useless for now, data scientists can work with the data. They might find new ways to analyse the data and generate valuable insight from it.
What kind of systems can be used to store big data?
Systems like Hadoop HDFS, Hbase, Amazon S3 or DynamoDB are a perfect fit to store big data.
Check out my podcast how to decide between SQL and NoSQL: https://anchor.fm/andreaskayy/embed/episodes/NoSQL-Vs-SQL-How-To-Choose-e12f1o
\section{Display}
Displaying data is as important as ingesting, storing and analysing it. People need to be able to make data driven decisions.
This is why it is important to have a good visual presentation of the data. Sometimes you have a lot of different use cases or projects using the platform.
It might not be possible for you to build the perfect UI that fits everyone. What you should do in this case is enable others to build the perfect UI themselves.
How to do that? By creating APIs to access the data and making them available to developers.
Either way, UI or API the trick is to give the display stage direct access to the data in the big data cluster. This kind of access will allow the developers to use analytics results as well as raw data to build the the perfect application.
\chapter{ Lambda Architecture}
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#077 Lambda Architecture and Kappa Architecture} \\ \hline
\multicolumn{2}{p{15cm}}{In this stream we talk about the lambda architecture with stream and batch processing as well as a alternative the Kappa Architecture that consists only of streaming. Also Data engineer vs data scientist and we discuss Andrew Ng's AI Transformation Playbook} \\ \hline
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/077-Lambda--Kappa-Architecture-e45j0r}{Click here to listen} \\
\multicolumn{1}{l|}{Youtube} & \href{https://youtu.be/iUOQPyHN9-0}{Click here to watch} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 077 Lambda Architecture and Kappa Architecture} %\label{tbl:spotifycasestudy}
\end{table}
\section{Batch Processing} Ask the big questions. Remember your last yearly tax statement?
You break out the folders. You run around the house searching for the receipts.
All that fun stuff.
When you finally found everything you fill out the form and send it on its way.
Doing the tax statement is a prime example of a batch process.
Data comes in and gets stored, analytics loads the data from storage and creates an output (insight):
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Simple-Batch-Processing-Workflow}
\caption{Batch Processing Pipeline}
\label{fig:Bild1}
\end{figure}
Batch processing is something you do either without a schedule or on a schedule (tax statement). It is used to ask the big questions and gain the insights by looking at the big picture.
To do so, batch processing jobs use large amounts of data. This data is provided by storage systems like Hadoop HDFS.
They can store lots of data (petabytes) without a problem.
Results from batch jobs are very useful, but the execution time is high. Because the amount of used data is high.
It can take minutes or sometimes hours until you get your results.
\section{Stream Processing} Gain instant insight into your data.
Streaming allows users to make quick decisions and take actions based on “real-time” insight. Contrary to batch processing, streaming processes data on the fly, as it comes in.
With streaming you don’t have to wait minutes or hours to get results. You gain instant insight into your data.
In the batch processing pipeline, the analytics was after the data storage. It had access to all the available data.
Stream processing creates insight before the data storage. It has only access to fragments of data as it comes in.
As a result the scope of the produced insight is also limited. Because the big picture is missing.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Simple-Stream-Processing-Workflow}
\caption{Stream Processing Pipeline}
\label{fig:Bild1}
\end{figure}
Only with streaming analytics you are able to create advanced services for the customer. Netflix for instance incorporated stream processing into Chuckwa V2.0 and the new Keystone pipeline.
One example of advanced services through stream processing is the Netflix “Trending Now” feature.
Check out the Netflix case study.
\section{Should you do stream or batch processing?}
It is a good idea to start with batch processing. Batch processing is the foundation of every good big data platform.
A batch processing architecture is simple, and therefore quick to set up. Platform simplicity means, it will also be relatively cheap to run.
A batch processing platform will enable you to quickly ask the big questions. They will give you invaluable insight into your data and customers.
When the time comes and you also need to do analytics on the fly, then add a streaming pipeline to your batch processing big data platform.
\section{Lambda Architecture Alternative }
\subsection{Kappa Architecture}
\subsection{Kappa Architecture with Kudu}
\section{Why a Good Data Platform Is Important}
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#066 How To Do Data Science From A Data Engineers Perspective} \\ \hline
\multicolumn{2}{p{15cm}}{A simple introduction how to do data science in the context of the internet of things.} \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/yp_cc4R0mGQ}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/066-How-To-Do-Data-Science-From-A-Data-Engineers-Perspective-e45imt}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 066 How To Do Data Science From A Data Engineers..} %\label{tbl:spotifycasestudy}
\end{table}
\chapter{Data Warehouse vs Data Lake}
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#055 Data Warehouse vs Data Lake} \\ \hline
\multicolumn{2}{p{15cm}}{On this podcast we are going to talk about data warehouses and data lakes?
When do people use which?
What are the pros and cons of both?
Architecture examples for both
Does it make sense to completely move to a data lake?} \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/8gNQTrUUwMk}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/055-Data-Warehouse-vs-Data-Lake-e45iem}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 055 Data Warehouse vs Data Lake} %\label{tbl:spotifycasestudy}
\end{table}
\chapter{Hadoop Platforms | available}
When people talk about big data, one of the first things come to mind is Hadoop. Google’s search for Hadoop returns about 28 million results.
It seems like you need Hadoop to do big data. Today I am going to shed light onto why Hadoop is so trendy.
You will see that Hadoop has evolved from a platform into an ecosystem. It’s design allows a lot of Apache projects and 3rd party tools to benefit from Hadoop.
I will conclude with my opinion on, if you need to learn Hadoop and if Hadoop is the right technology for everybody.
\section{What is Hadoop}
Hadoop is a platform for distributed storing and analyzing of very large data sets.
Hadoop has four main modules: Hadoop common, HDFS, MapReduce and YARN. The way these modules are woven together is what makes Hadoop so successful.
The Hadoop common libraries and functions are working in the background. That’s why I will not go further into them. They are mainly there to support Hadoop’s modules.
\begin{table}[h]
\begin{tabular}{ll}
\hline
\multicolumn{2}{l}{\textbf{Podcast Episode:} \#060 What Is Hadoop And Is Hadoop Still Relevant In 2019?} \\ \hline
\multicolumn{2}{p{15cm}}{A Introduction into Hadoop HDFS, YARN and MapReduce. Yes, Hadoop is still relevant in 2019 even if you look into serverless tools. } \\ \hline
\multicolumn{1}{l|}{YouTube} & \href{https://youtu.be/8AWaht3YQgo}{Click here to watch} \\
\multicolumn{1}{l|}{Audio} & \href{https://anchor.fm/andreaskayy/episodes/060-What-Is-Hadoop-And-Is-Hadoop-Still-Relevant-In-2019-e45ijp}{Click here to listen} \\ \hline
\end{tabular}
\captionof{table}{Podcast: 060 What Is Hadoop And Is Hadoop Still Relevant In 2019?} %\label{tbl:spotifycasestudy}
\end{table}
\section{What makes Hadoop so popular? | available}
Storing and analyzing data as large as you want is nice. But what makes Hadoop so popular?
Hadoop’s core functionality is the driver of Hadoop’s adoption. Many Apache side projects use it’s core functions.
Because of all those side projects Hadoop has turned more into an ecosystem. An ecosystem for storing and processing big data.
To better visualize this eco system I have drawn you the following graphic. It shows some projects of the Hadoop ecosystem who are closely connected with the Hadoop.
It is not a complete list. There are many more tools that even I don’t know. Maybe I am drawing a complete map in the future.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Hadoop-Ecosystem.png}
\caption{Hadoop Ecosystem Components}
\label{fig:Bild1}
\end{figure}
\section{Hadoop Ecosystem Components}
Remember my big data platform blueprint? The blueprint has four stages: Ingest, store, analyse and display.
Because of the Hadoop ecosystem” the different tools in these stages can work together perfectly.
Here’s an example:
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Hadoop-Ecosystem-Connections.png}
\caption{Connections between tools}
\label{fig:Bild1}
\end{figure}
You use Apache Kafka to ingest data, and store the it in HDFS. You do the analytics with Apache Spark and as a backend for the display you store data in Apache HBase.
To have a working system you also need YARN for resource management. You also need Zookeeper, a configuration management service to use Kafka and HBase
As you can see in the picture below each project is closely connected to the other.
Spark for instance, can directly access Kafka to consume messages. It is able to access HDFS for storing or processing stored data.
It also can write into HBase to push analytics results to the front end.
The cool thing of such ecosystem is that it is easy to build in new functions.
Want to store data from Kafka directly into HDFS without using Spark?
No problem, there is a project for that. Apache Flume has interfaces for Kafka and HDFS.
It can act as an agent to consume messages from Kafka and store them into HDFS. You even do not have to worry about Flume resource management.
Flume can use Hadoop’s YARN resource manager out of the box.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{images/Hadoop-Ecosystem-Connections-Flume.png}
\caption{Flume Integration}
\label{fig:Bild1}
\end{figure}
\section{Hadoop Is Everywhere?}
Although Hadoop is so popular it is not the silver bullet. It isn’t the tool that you should use for everything.
Often times it does not make sense to deploy a Hadoop cluster, because it can be overkill. Hadoop does not run on a single server.
You basically need at least five servers, better six to run a small cluster. Because of that. the initial platform costs are quite high.
One option you have is to use a specialized systems like Cassandra, MongoDB or other NoSQL DB’s for storage. Or you move to Amazon and use Amazon’s Simple Storage Service, or S3.
Guess what the tech behind S3 is. Yes, HDFS. That’s why AWS also has the equivalent to MapReduce named Elastic MapReduce.
The great thing about S3 is that you can start very small. When your system grows you don’t have to worry about s3’s server scaling.
\section{Should you learn Hadoop? }
Yes, I definitely recommend you to get to now how Hadoop works and how to use it. As I have shown you in this article, the ecosystem is quite large.
Many big data projects use Hadoop or can interface with it. Thats why it is generally a good idea to know as many big data technologies as possible.
Not in depth, but to the point that you know how they work and how you can use them. Your main goal should be to be able to hit the ground running when you join a big data project.
Plus, most of the technologies are open source. You can try them out for free.
\subsubsection{How does a Hadoop System architecture look like}
\subsubsection{What tools are usually in a with Hadoop Cluster}
Yarn
Zookeeper
HDFS
Oozie
Flume
Hive
\section{How to select Hadoop Cluster Hardware}