-
Notifications
You must be signed in to change notification settings - Fork 0
/
chapter_tmcomposition.tex
1404 lines (1208 loc) · 154 KB
/
chapter_tmcomposition.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{The ``negative-outside'' rule}
\sloppy
The research presented in this chapter is published work presented in Baker \textit{et al.,} 2017 titled `Charged residues next to transmembrane regions revisited: ``Positive\--inside rule'' is complemented by the ``negative inside depletion/outside enrichment rule''' by James Alexander Baker, Wing\--Cheong Wong, Birgit Eisenhaber, Jim Warwicker, and Frank Eisenhaber~\cite{Baker2017}.
Here we include the supplementary tables and figures in the text.
\section{Summary}
\subsection{Background}
Transmembrane helices frequently occur amongst protein architectures as means for proteins to attach to or embed into biological membranes.
Physical constraints such as the membrane’s hydrophobicity and electrostatic potential apply uniform requirements to transmembrane helices and their flanking regions; consequently, they are mirrored in their sequence patterns (in addition to transmembrane helices being a span of generally hydrophobic residues) on top of variations enforced by the specific protein’s biological functions.
\subsection{Results}
With statistics derived from a large body of protein sequences, we demonstrate that, in addition to the positive charge preference at the cytoplasmic inside (positive\--inside rule), negatively\--charged residues preferentially occur or are even enriched at the non\--cytoplasmic flank or, at least, they are suppressed at the cytoplasmic flank (negative\--not\--inside/negative\--outside rule).
As negative residues are generally rare within or near transmembrane helices, the statistical significance is sensitive with regard to details of transmembrane helix alignment and residue frequency normalisation and also to dataset size; therefore, this trend was obscured in previous work.
We observe variations amongst taxa as well as for key organelles along the secretory pathway.
The effect is most pronounced for transmembrane helices from single\--pass transmembrane (bitopic) proteins compared to those with multiple transmembrane helices (polytopic proteins) and especially for the class of simple transmembrane helices that evolved for the sole role as membrane anchors.
\subsection{Conclusions}
The charged-residue flank bias is only one of the transmembrane helix sequence features with a role in the anchorage mechanisms, others apparently being the leucine intra-helix propensity skew towards the cytoplasmic side, tryptophan flanking as well as the cysteine and tyrosine inside preference.
These observations will stimulate new prediction methods for transmembrane helices and protein topology from a sequence as well as new engineering designs for artificial membrane proteins.
\section{Introduction}
Two decades ago, the classic concept of a~\gls{tmh} was a rather simple story: typical~\gls{tmp}s were thought to be anchored in the membrane by membrane-spanning bundles of non-polar \(\alpha\)--helices of roughly 20 residues length, with a consistent orientation of being perpendicular to the membrane surface.
Although this is broadly true, hundreds of high quality membrane structures have elucidated that membrane-embedded helices can adopt a plethora of lengths and orientations within the membrane.
They are capable of just partial spanning of the membrane, spanning using oblique angles, and even lying flat on the membrane surface~\cite{Elofsson2007, VonHeijne2006}.
The insertion and formation of the~\gls{tmh}s follow a complex thermodynamic equilibrium~\cite{Moon2013, MacCallum2011, Cymer2015}.
From the biological function point of view, many~\gls{tmh}s have multiple roles besides being just hydrophobic anchors; for example, certain~\gls{tmh}s have been identified as regulators of protein quality control and trafficking mechanisms~\cite{Hessa2011}.
As these additional biological functions are mirrored in the~\gls{tmh}s’ sequence patterns,~\gls{tmh}s can be classified as simple (just hydrophobic anchors) and complex sequence segments~\cite{Wong2010, Wong2011, Wong2012}.
The relationship between sequence patterns in and in the vicinity of~\gls{tmh}s and their structural and functional properties, as well as their interaction with the lipid bilayer membrane, has been a field of intensive research in the last three decades~\cite{Ladokhin2015}.
Besides the span of generally hydrophobic residues in the~\gls{tmh}, there are other trends in the sequence such as with a saddle-like distribution of polar residues (depressed incidence of charged residues in the~\gls{tmh} itself), an enriched occurrence of positively\--charged residues in the cytosolic flanking regions as well as an increased likelihood of tryptophan and tyrosine at either flank edge~\cite{Sharpe2010, VonHeijne1986,VonHeijne1988,VonHeijne1989, Baeza-Delgado2013, Granseth2005}.
Such properties vary somewhat in length and intensity between various biological organelle membranes, between prokaryotes and eukaryotes~\cite{Ojemalm2013} and even among eukaryotic species studied due to slightly different membrane constraints~\cite{Sharpe2010, Pogozheva2013}.
These biological dispositions are exploitable in terms of~\gls{tm} region prediction in query protein sequences~\cite{Beuming2004, Zhao2006} and tools such as the quite reliable TMHMM~\cite{Krogh2001,Sonnhammer1998}, Phobius~\cite{Kall2004, Kall2007} or DAS-TMfilter represent today’s prediction limit of~\gls{tmh}s’ hydrophobic cores within the protein sequence~\cite{Cserzo2002, Cserzo2004, Kall2002}.
The prediction accuracy for true positives and negatives is reported to be close to 100\% and the remaining main cause of false positive prediction are hydrophobic \(\alpha\)--helices completely buried in the hydrophobic core of proteins.
To note, reliable prediction of~\gls{tmh}s and protein topology is a strong restriction for protein function of even otherwise non\--characterised proteins~\cite{Eisenhaber2016, Eisenhaber2012, Sherman2015} and thus, very valuable information.
The ``positive\--inside rule'' reported by von Heijne~\cite{VonHeijne2006, VonHeijne1989} postulates the preferential occurrence of positively\--charged residues (lysine and arginine) at the cytoplasmic edge of~\gls{tmh}s.
The practical value of positively\--charged residue sequence clustering in topology prediction of~\gls{tmh} was first shown for the plasmalemma in bacteria~\cite{VonHeijne1989, Sipos1993}.
As a trend, the ``positive-inside rule'' has since been confirmed with statistical observations for most membrane proteins and biological membrane types~\cite{Baeza-Delgado2013, Gavel1991, Nilsson2005a, Wallin1998}.
However, more recent evidence suggests that, in thylakoid membranes, the ``positive-inside rule'' is less applicable due to the co-occurrence of aspartic acid and glutamic acid residues together with positively\--charged residues~\cite{Pogozheva2013}.
The positive-inside rule also received support from protein engineering experiments that revealed conclusive evidence for positive charges as a topological determinant~\cite{VonHeijne1989, Beltzer1991, Kida2006, Nilsson1990}.
Mutational experiments demonstrated that charged residues, when inserted into the centre of the helix, had a large effect on insertion capabilities of the~\gls{tmh} via the translocon.
Insertion becomes more unfavourable when the charge was placed closer to the~\gls{tmh} core~\cite{Hessa2005}.
It remains unclear exactly why and how exactly the positive charge determines topology from a biophysical perspective.
positively\--charged residues are suggested to be stronger determinants of topology than negatively\--charged residues due to a dampening of the translocation potential of negatively\--charged residues.
This dampening factor is the result of protein-lipid interactions with net zero charged phospholipid, phosphatidylethanolamine and other neutral lipids.
This effect favours cytoplasmic retention of positively\--charged residues~\cite{Bogdanov2014}.
The recent accumulation of~\gls{tmp} sequences and structures allowed revisiting the problem of charged residue distribution in~\gls{tmh}s (see also \url{http://blanco.biomol.uci.edu/mpstruc/}).
For example, whilst \(\beta\)--sheets contain charged residues in the~\gls{tm} region, $\alpha$\--helices generally do not \cite{Ulmschneider2001}.
Large-scale sequence analysis of~\gls{tmh} from various organelle membrane surfaces in eukaryotic proteomes confirm the clustering of positive charge having a statistical bias for the cytosolic side of the membrane.
At the same time, there are many~\gls{tmh} exception examples to the positive-inside rule; however as a trend, topology can be determined by simply looking for the most positive loop region between helices~\cite{Sharpe2010, Baeza-Delgado2013}.
When the observation of positively\--charged residues preferentially localised at the cytoplasmic edge of~\gls{tmh}s emerged, it was also asked whether negatively\--charged residues work in concert with~\gls{tmh} orientation.
It was shown that a single additional lysine residue can reverse the topology of a model \textit{Escherichia coli} protein, whereas a much higher number of negatively\--charged residues is needed to achieve the same~\cite{Nilsson1990}; nevertheless, a sufficiently large negative charge can overturn the positive-inside rule~\cite{Andersson1993, Kim1994} and, thus indeed, negative residues are topologically active to a point.
negatively\--charged residues were observed in the flanks of~\gls{tmh}s~\cite{Baeza-Delgado2013}, especially of marginally hydrophobic~\gls{tm} regions~\cite{Delgado-Partin1998}.
It is known that the negatively\--charged acidic residues in~\gls{tm} regions have a non-trivial role in the biological context.
In \textit{E.
coli}, negative residues experience electrical pulling forces when travelling through the SecYEG translocon indicating that negative charges are biologically relevant during the electrostatic interactions of insertion~\cite{Ismail2012, Ismail2015}.
Unfortunately, there is a problem with statistical evidence for preferential negative charge occurrence next to~\gls{tmh} regions.
Early investigations indicated overall both positive and negative charge were influential topology factors, dubbed the charge balance rule.
If true, one would also expect to see a skew in the negative charge distribution if a cooperation between oppositely charged residues orientated a~\gls{tmh}~\cite{Sipos1993, Hartmann1989}.
It might be expected that, if positive residues force the loop or tail to stay inside, negative residues would be drawn outside and topology would be determined not unlike electrophoresis.
Yet, there is plenty of individual protein examples but no conclusive statistical evidence in the current literature for a negatively\--charged skew~\cite{Sharpe2010, Baeza-Delgado2013, Granseth2005, Pogozheva2013, Nilsson2005a, Andersson1992}.
There are many observations described in the literature that charged residues determine topology more predictably in single\--pass proteins than in multi\--pass~\gls{tmh}~\cite{Kim1994, Harley1998}.
It is thought that the charges only determine the initial orientation of the~\gls{tmh} in the biological membrane; yet, the ultimate orientation must be determined together with the totality of subsequent downstream regions~\cite{Sato1998}.
With sequence-based hydrophobicity and volume analysis and consensus sequence studies, Sharpe \textit{et al.}~\cite{Sharpe2010} demonstrated that there is asymmetry in the intramembranous space of some membranes.
Crucially, this asymmetry differs among the membrane of various organelles.
They conclude that there are general differences between the lipid composition and organisation in membranes of the Golgi and~\gls{er}.
Functional aspects are also important.
For example, the abundance of serines in the region following the lumenal end of Golgi~\gls{tmh}s appears to reflect the fact that this part of many Golgi enzymes forms a flexible linker that tethers the catalytic domain to the membrane~\cite{Sharpe2010}.
A study by Baeza-Delgado \textit{et al.}~\cite{Baeza-Delgado2013} analysed the distribution of amino acid residue types in~\gls{tmh}s in 170 integral membrane proteins from a manually maintained database of experimentally confirmed~\gls{tmp}s (MPTopo~\cite{Jayasinghe2001}) as well as in 930 structures from the~\gls{pdb}.
As expected, half of the natural amino acids are equally distributed along~\gls{tmh} whereas aromatic, polar and charged amino acids along with proline are biased near the flanks of the TM helices.
Unsurprisingly, leucine and other non-polar residues are far more abundant than the charged residues in the~\gls{tm} region~\cite{Sharpe2010, Baeza-Delgado2013}.
In this work, we revisit the issue of statistical evidence for the preferential distribution of negatively\--charged (and a few other) residues within and nearby~\gls{tmh}s.
We rely on the improved availability of comprehensive and large sequence and structure datasets for~\gls{tm} proteins.
We also show that several methodical aspects have hindered previous studies~\cite{Sharpe2010, Baeza-Delgado2013, Pogozheva2013} to see the consistent non-trivial skew for negatively\--charged residues disfavouring the cytosolic interfacial region and/or preferring the outside flank.
First, we show that acidic residues are especially rare within and in the close sequence environment of~\gls{tmh}s, even when compared to positively\--charged lysine and arginine.
Second, therefore, the manner of normalisation is critical: Taken together with the difficulty to properly align~\gls{tmh}s relative to their boundaries, column-wise frequency calculations relative to all amino acid types as in previous studies will blur possible preferential localisations of negative charges in the sequence.
However, the outcome changes when we ask where a negative charge occurs in the sequence relative to the total amount of negative charges in the respective sequence region.
Thus, by accounting for the rarity of acidic residues with sensitive normalisation, the ``non-negative inside rule/negative-outside rule'' is clearly supported by the statistical data.
We find that minor changes in the flank definitions such as taking the~\gls{tmh} boundaries from the database or by generating flanks by centrally aligning~\gls{tmh}s and applying some standardised~\gls{tmh} length does not have a noticeable influence on the charge bias detected.
Third, there are significant differences in the distribution of amino acid residues between single\--pass and multi\--pass~\gls{tm} regions in both the intra-membrane helix and the flanking regions with further variations introduced by taxa and by the organelles along the secretory pathway.
Importantly, we find that it is critical to weigh down the effect of~\gls{tmh}s in multi\--pass~\gls{tmp}s with no or super-short flanks to observe statistical significance for the charge bias.
To say it bluntly, if there are no flanks of sufficient length, there is also no negative\--charge bias to be observed.
The charge bias effect is even clearer when a classification of~\gls{tmh}s into so-called simple (which, as a trend, are mostly single\--pass and mere anchors) and so-called complex (which typically have functions beyond anchorage) is considered~\cite{Wong2010, Wong2011, Wong2012}.
We also observe parallel skews with regard to leucine, tyrosine, tryptophan and cysteine distributions.
With these large-scale datasets and a sensitive normalisation approach, new sequence features are revealed that provide spatial insight into~\gls{tmh} membrane anchoring, recognition, helix-lipid, and helix-helix interactions.
\section{Results}
\subsection{Acidic residues within and nearby transmembrane helix segments are rare}
In order to reliably compare the amino acid sequence properties of~\gls{tmh}s, we assembled datasets of~\gls{tmh} proteins from what are likely to be the best in terms of quality and comprehensiveness of annotation in eukaryotic and prokaryotic representative genomes, as well as composite datasets to represent larger taxonomic groups and with regard to sub-cellular locations (see Table \ref{table:acidicresiduesarerare}).
In total, 3292 single\--pass~\gls{tmh} segments and 29898 multi\--pass~\gls{tmh} segments were extracted from various UniProt~\cite{TheUniProtConsortium2014} text files according to TRANSMEM annotation (download dated 20--03--2016).
The UniProt datasets used only included manually curated records; however, it is still necessary to check for systematic bias due to the prediction methods used by UniProt for~\gls{tmh} annotation in the majority of cases without direct experimental evidence.
Therefore, a fully experimentally verified dataset was also generated for comparison.
The representative 1544 single\--pass and 15563~\gls{tmh}s were extracted from the manually curated experimentally verified TOPDB~\cite{Dobson2015} database (download dated 21--03--2016) referred to as ExpAll here (Table \ref{table:acidicresiduesarerare}).
\gls{tmh} organelle residency is defined according to UniProt annotation.
To ensure reliability, organelles were only analysed from a representative redundancy-reduced protein dataset of the most well-studied genome: \textit{Homo sapiens} (referred to as UniHuman herein).
The several datasets from UniProt are subdivided into different human organelles (UniPM, UniER, UniGolgi) and taxonomical groups (UniHuman, UniCress, UniBacilli, UniEcoli, UniArch, UniFungi) as described in Table \ref{table:acidicresiduesarerare} (see also Methods section).
As will be shown below, these various datasets allow us to validate our findings for a variety of conditions, namely with regard (i) to experimental verification of~\gls{tmh}s, (ii) to origin from various species and taxonomic groups, (iii) to the number of~\gls{tmh}s in the same protein as well as (iv) to sub-cellular localisation.
Data-sets and programs used in this work can be downloaded from \url{http://mendel.bii.a-star.edu.sg/SEQUENCES/NNI/}.
% Table generated by Excel2LaTeX from sheet 'Sheet1'
\begin{table}[htbp]
\centering
\captionof{table}[Acidic residues are rarer in transmembrane helices of single\--pass proteins than in transmembrane helices of multi\--pass proteins.]{\textbf{Acidic residues are rarer in transmembrane helices of single\--pass proteins than in transmembrane helices of multi\--pass proteins.}The statistical results when comparing the number of acidic residues in single\--pass or multi\--pass~\gls{tmh}s within their database-defined limits and excluding any flanks.
The number of helices per dataset can be found in Table~\ref{table:negativeskewsinglepass} for single\--pass~\gls{tmh}s and Table~\ref{table:multipassstats} for multi\--pass helices.
$\mu$ SP is the average number of the respective residues per helix in~\gls{tmh}s from single\--pass proteins, while $\mu$ MP is the average number of the respective residues per~\gls{tmh} from multi\--pass proteins.
The Kruskal-Wallis test scores (H statistics) were calculated for the numbers of aspartic acid and glutamic acid residues in each helix from single\--pass and the number of aspartic acid and glutamic acid residues in each helix from multi\--pass~\gls{tmh}s}
\resizebox{\textwidth}{!}{
\begin{tabular}{p{5em}rrp{5em}rrp{5em}rrp{5em}}
\toprule
\footnotesize
\multirow{2}[4]{*}{\textbf{Data-set}} & \multicolumn{3}{p{15em}}{\textbf{Acidic residues (D and E)}} & \multicolumn{3}{p{15em}}{\textbf{Aspartic acid (D only)}} & \multicolumn{3}{p{15em}}{\textbf{Glutamic acid (E only)}} \\
\cmidrule{2-10} \multicolumn{1}{l}{} & \multicolumn{1}{p{5em}}{\textbf{$\mu$ SP}} & \multicolumn{1}{p{5em}}{\textbf{$\mu$ MP}} & \specialcell{\textbf{H statistic}\\ \textbf{p\--value}} & \multicolumn{1}{p{5em}}{\textbf{$\mu$ SP}} & \multicolumn{1}{p{5em}}{\textbf{$\mu$ MP}} & \specialcell{\textbf{H statistic}} \textbf{p\--value} & \multicolumn{1}{p{5em}}{\textbf{$\mu$ SP}} & \multicolumn{1}{p{5em}}{\textbf{$\mu$ MP}} & \specialcell{\textbf{H statistic} \\ \textbf{p\--value}} \\
\midrule
ExpAll & 0.086 & 0.309 & \specialcell{148.1 \\ 4.50E-34} & 0.045 & 0.157 & \specialcell{40.3 \\ 2.13E-10} & 0.042 & 0.161 & \specialcell{46.6\\ 8.64E-12} \\
\midrule
UniHuman & 0.076 & 0.398 & \specialcell{316.5 \\ 8.31E-71} & 0.034 & 0.191 & \specialcell{91.6 \\ 1.05E-21} & 0.042 & 0.207 & \specialcell{100.3 \\ 1.33E-23} \\
\midrule
UniER & 0.106 & 0.43 & \specialcell{34.4 \\ 4.39E-9} & 0.061 & 0.161 & \specialcell{8.0 \\ 4.72E-3} & 0.045 & 0.268 & \specialcell{26.8 \\ 2.24E-7} \\
\midrule
UniGolgi & 0.097 & 0.381 & \specialcell{39.8 \\ 2.88E-10} & 0.043 & 0.18 & \specialcell{19.4 \\ 1.05E-5} & 0.053 & 0.201 & \specialcell{20.2 \\ 7.01E-6} \\
\midrule
UniPM & 0.039 & 0.4 & \specialcell{121.0 \\ 3.86E-28} & 0.016 & 0.187 & \specialcell{32.7 \\ 1.06E-8} & 0.022 & 0.213 & \specialcell{36.9 \\ 1.26E-9} \\
\midrule
UniCress & 0.062 & 0.434 & \specialcell{163.5 \\ 1.99E-37} & 0.036 & 0.198 & \specialcell{32.5 \\ 1.20E-8} & 0.025 & 0.241 & \specialcell{66.0 \\ 4.59E-16} \\
\midrule
UniFungi & 0.177 & 0.349 & \specialcell{43.1 \\ 5.14E-11} & 0.044 & 0.166 & \specialcell{24.5 \\ 7.60E-7} & 0.133 & 0.183 & \specialcell{4.6 \\ 0.033 }\\
\midrule
UniBacilli & 0.089 & 0.352 & \specialcell{24.1 \\ 9.16E-7} & 0.048 & 0.185 & \specialcell{11.2 \\ 8.27E-4} & 0.04 & 0.176 & \specialcell{12.3 \\ 4.54E-5} \\
\midrule
UniEcoli & 0.148 & 0.315 & \specialcell{2.7 \\ 0.100} & 0.111 & 0.15 & \specialcell{0.1 \\ 0.729 }& 0.037 & 0.163 & \specialcell{2.2 \\ 0.140 }\\
\midrule
UniArch & 0.438 & 0.606 & \specialcell{1.8 \\ 0.183} & 0.083 & 0.344 & \specialcell{11.2 \\ 8.33E-4} & 0.354 & 0.247 & \specialcell{3.5 \\ 0.0624 }\\
\bottomrule
\end{tabular}}%
\label{table:acidicresiduesarerare}
\end{table}%
The hydrophobic nature of the lipid bilayer membrane implies that, generally, charged residues should be rare within~\gls{tmh}s.
For acidic residues, even the location in the sequence vicinity of~\gls{tmh}s should be disfavoured because of the negatively\--charged head groups of lipids directed towards the aqueous extracellular side or the cytoplasm.
In agreement with the biophysically justified expectations, the statistical data confirms that acidic residues are especially rare in~\gls{tmh}s and their flanking regions.
In Figure \ref{fig:amino_acid_distribution} where we plot the total abundance of all amino acid types in single\--pass~\gls{tmh}s and multi\--pass~\gls{tmh}s (including their $\pm$5 flanking residues), acidic residues were found to be amongst the rarest amino acids both in UniHuman and ExpAll.
\begin{figure}[!ht]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/amino_acid_distribution}
\captionof{figure}[Negatively\--charged amino acids are amongst the rarest residues in transmembrane helices and $\pm$5 flanking residues.] {\textbf{Negatively\--charged amino acids are amongst the rarest residues in transmembrane helices and $\pm$5 flanking residues.} Bar charts of the abundance of each amino acid type in the~\gls{tmh}s with flank lengths of the accompanying $\pm$5 residues from the (a) UniHuman single\--pass proteins, (b) ExpAll single\--pass proteins, (c) UniHuman multi\--pass proteins, and (d) ExpAll multi\--pass proteins.
Amino acid types on the horizontal axis are listed in descending count.
The bars were coloured according to categorisations of hydrophobic, neutral and hydrophilic types according to the free energy of insertion biological scale~\cite{Hessa2005}.
Grey represents hydrophilic amino acids that were found to have a positive $\Delta$G app, and blue represents hydrophobic residues with a negative $\Delta$G app, purple denotes negative residues and positive residues are coloured in orange.
The abundances of key residues are labelled.}
\label{fig:amino_acid_distribution}
\end{figure}
The effect is most pronounced in single\--pass~\gls{tmp}s (Figure~\ref{fig:amino_acid_distribution}).
There are only 666 glutamates (just 1.24\% of all residues) and 560 aspartates (1.05\% respectively) among the total set of 53238 residues comprised in 1705~\gls{tmh}s and their flanks.
Within just the~\gls{tmh} regions, there are 71 glutamates (0.20\% of all residues in~\gls{tmh}s and flanks) and 58 aspartates (0.16\% respectively).
This cannot be an artefact of UniProt~\gls{tmh} assignments since this feature is repeated in ExpAll.
There are only 582 glutamates (1.22\%) and 520 aspartates (1.09\%) among the 47568 residues involved.
Within the~\gls{tmh} itself, there are 64 glutamates (0.19\%) and 69 aspartates (0.21\%).
In both cases, the negatively\--charged residues represent the ultimate end of the distribution.
To note, acidic residues are rare even compared to positively\--charged residues which are about 3--4 times more frequent.
On a much smaller dataset of single-spanning~\gls{tmp}, Nakashima \textit{et al.}
~\cite{Nakashima1992} made similar compositional studies.
To compare, they found 0.94\% glutamate and 0.94\% aspartate within just the~\gls{tmh} region (values very similar to ours from~\gls{tmh}s with small flanks; apparently, they used more outwardly defined~\gls{tmh} boundaries) but the content of each glutamate and aspartate within the extracellular or cytoplasmic domains is larger by an order of magnitude, between 5.26\% and 9.34\%.
These latter values tend to be even higher than the average glutamate and aspartate composition throughout the protein database (5--6\%~\cite{Nakashima1992}).
In the case of multi\--pass~\gls{tmp}s (Figure~\ref{fig:amino_acid_distribution}), glutamates and aspartates are still very rare in~\gls{tmh}s and their $\pm$5 residue flanks (1.94\% and 1.92\% from the total of 377207 in the case of UniHuman respectively, 1.79\% and 1.70\% from the total of 454700 in the case of ExpAll).
Yet, their occurrence is similar to those of histidine and tryptophan and, notably, acidic residues are only about $\sim$1.5 times less frequent than positively\--charged residues.
The observation that acidic residues are more suppressed in single\--pass~\gls{tmh}s compared with the case of multi\--pass~\gls{tmh}s is statistically significant.
In Table \ref{table:acidicresiduesarerare}, the acidic residues are counted in the helices (excluding flanking regions) belonging to either multi\--pass or single\--pass helices.
Indeed, single\--pass helices appear to tolerate negative charge to a far lesser extent than multi\--pass helices as the data in the top two rows of Table \ref{table:acidicresiduesarerare} indicates (for datasets UniHuman and ExpAll).
The trend is strictly observed throughout sub-cellular localisations (rows 3--5 in Table \ref{table:acidicresiduesarerare}) and taxa (rows 6--10).
Statistical significance (P$\leq$0.001) is found in all but six cases.
These are UniEcoli (D+E, D, E), UniArch (D+E, E) and UniFungi (E).
The problem is, most likely, that the respective datasets are quite small.
Notably, the difference between single- and multi\--pass~\gls{tmh}s is greatest in UniPM\@; here,~\gls{tmh}s from multi\--pass proteins have on average 0.400 negative residues per helix, whereas single\--pass~\gls{tmh}s contained just 0.039 (P=3.86e-28).
\subsection{Amino acid residue distribution analysis reveals a ``negative-not-inside/negative-outside'' signal in single\--pass transmembrane helix segments}
The rarity of negatively\--charged residues is a complicating issue when studying their distribution along the sequence positions of~\gls{tmh}s and their flanks.
For UniHuman and ExpAll , we plotted absolute abundance of aspartic acid, glutamic acid, lysine, arginine, and leucine at each position (i.e., it scales as the equivalent fraction in the total composition of the alignment column) (Figure~\ref{fig:single_pass_charge_distribution}).
To note, the known preference of positively\--charged residues towards the cytoplasmic side is nevertheless evident.
Yet, it becomes apparent that any bias in the occurrence of the much rarer acidic residues is overshadowed by fluctuations in the highly abundant residues such as leucine.
\begin{figure}[p]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/single_pass_charge_distribution}
\captionof{figure}[Relative percentage normalisation reveals a negative-outside bias in transmembrane helices from single\--pass protein datasets.]{\textbf{Relative percentage normalisation reveals a negative-outside bias in transmembrane helices from single\--pass protein datasets.} All flank sizes were set at up to $\pm$20 residues.
We acknowledge that all values, besides the averaged values, are discrete, and connecting lines are illustrative only.
On the horizontal axes (a–d) are the distances in residues from the centre of the~\gls{tmh}, with the negative numbers extending towards the cytoplasmic space.
For (e) and (f), the horizontal axis represents the residue count from the membrane boundary with negative counts into the cytoplasmic space.
Leucine, the most abundant non-polar residue in~\gls{tmh}s, is in blue.
Arginine and lysine are shown in dark and light orange respectively.
Aspartic and glutamic acid are showing in dark and light purple respectively.
In red are the uncharged polar amino acids serine, asparagine, glutamine, threonine, tyrosine, and cysteine.
(a) and (b) On the vertical axis is the absolute abundance of residues in~\gls{tmh}s from single\--pass proteins from (a) UniHuman and (b) ExpAll.
Note that no clear trend can be seen in the negative residue distribution compared to the positive-inside signal and the leucine abundance throughout the~\gls{tmh}.
c and d On the vertical axis is the relative percentage at each position for~\gls{tmh}s from single\--pass proteins from (c) UniHuman and (d) ExpAll.
The dashed lines show the estimation of the background level of residues with respect to the colour; an average of the relative percentage values between positions 25 to 30 and –30 to –25.
The thick bars show the averages on the inner (positions –20 to –10) and outer (positions 10 to 20) flanks coloured to the respective amino acid type.
Note a visible suppression of acidic residues on the inside flank when compared to the outside flank in single\--pass proteins when normalising according to the relative percentage.
(e) and (f) The relative distribution of flanks defined by the databases with the distance from the~\gls{tmh} boundary on the horizontal axis.
The inside and outside flanks are shown in separate subplots.
The colouring is the same as in (a) and (b).}
\label{fig:single_pass_charge_distribution}
\end{figure}
The trends become clearer if the occurrence of specific residues is normalised with the total number of residues of the given amino acid type in the dataset observed in the sequence region studied as shown for UniHuman and for ExpAll in Figure~\ref{fig:single_pass_charge_distribution}.
For comparison, we indicated background residue occurrences (dashed lines calculated as averages for positions -25 to -30 and 25 to 30).
The respective average occurrences in the inside and outside flanks (calculated from an average of the values at positions -20 to -10 and 10 to 20 respectively) are shown with wide lines.
The ``positive-inside rule'' becomes even more evident in this normalisation: Whereas the occurrence of positively\--charged residues is about the background level at the outside flank, it is about two to three times higher both for the UniHuman and the ExpAll datasets at the inside flank.
To note, the background level was found to be 1.7\% (lysine) and 1.6\% (arginine) in UniHuman and 1.4\% (lysine and arginine) in ExpAll.
The inside flank average is 4.3\% (lysine) and 4.6\% (arginine) in UniHuman and 4.2\% (lysine) and 4.6\% (arginine) in ExpAll.
The outside flank is similar to the background noise levels: about 1.4\% (lysine) and 1.5\% (arginine) in UniHuman and about 1.5\% (lysine) and 1.4\% (arginine) in ExpAll.
Most interestingly, a ``negative‑inside depletion'' trend for the negatively\--charged residues is apparent from the distribution bias.
The inside flank averages for glutamic acid were 1.1\% and 1.4\% in UniHuman and ExpAll respectively; for aspartic acid, 1.2\% and 1.4\% in UniHuman and ExpAll respectively.
Meanwhile, the outside flanks for aspartic acid and glutamic acid occurrences were measured at 2.9\% and 2.4\% respectively in UniHuman and, in ExpAll, these values for aspartic acid and glutamic acid were found to be 2.5\% and 2.1\% respectively.
Against the background level of aspartic acid (2.8\% and 2.9\% in UniHuman) and glutamic acid (2.6\% and 2.9\% in ExpAll), the inside flank averages were found to be about 2--3 times lower than the background level while the outside flank averages were comparable to the background level (Figure~\ref{fig:single_pass_charge_distribution}).
Taken together, this indicates a clear suppression of negatively\--charged residues at the inside flank of single\--pass~\gls{tmh}s and a possible trend for negatively\--charged residues occurring preferentially at the outside flank.
This is not an effect of the flank definition selection since the trend remains the same when using the database-defined flanks without the context of the~\gls{tmh} (Figure~\ref{fig:single_pass_charge_distribution}).
For UniHuman, the negative charge expectancy on the inside flank doesn’t reach above 2\% until position -10 (D) and position -11 (E), whereas, on the outside flank, both D and E start $>$2\%.
The same can be seen in ExpAll where negative residues reach above 2\% only as far from the membrane boundary as at position -9 (D) and position -7 (E) on the inside but exceed 2\% beginning with position 1 (D) and 3 (E) on the outside (Figure~\ref{fig:single_pass_charge_distribution}).
Residue presence is a zero\--sum variable.
If there is more likelihood of a positively\--charged residue being present at an inside position, then there must be less probability of at least one type of amino acid at that position.
To check if this probability was spread throughout non\--charged amino acids as well as negatively\--charged amino acids, we also examined non-charged polar residues for any inside versus outside preference (Figure~\ref{fig:single_pass_charge_distribution}B and Figure~\ref{fig:single_pass_charge_distribution}C).
As expected there was an increased prevalence at the flanks (peaking at position +12 with 2.27\% in Expall and 2.39 at position \--10 in UniHuman), however, there was no clear difference between the inside and outside flank relative percentages.
In ExpAll the inside flank (1.8\% relative percentage average) to outside (1.9\% relative percentage average) was between 5 to 10 times less than the negatively\--charged residue inside\--outside difference, and there was very little difference in the UniProt inside (1.88\% relative percentage average) to outside (1.94\% relative percentage average) relative abundance.
% multipass
% topdb inside flank = 1.9 outside =2.0
% unihuman inside =1.9 outside =2.2
The observation of negative charge suppression at the inside flank, herein the ``negative-inside depletion'' rule, is statistically significant throughout most datasets in this study.
The inside-outside bias was counted using the~\gls{kw} test comparing the occurrence of acidic residues within 10 residues of each~\gls{tmh} inside and outside the~\gls{tmh} (Table~\ref{table:negativeskewsinglepass}).
We studied both the database-reported flanks as well as those obtained from central alignment of~\gls{tmh}s (see Methods).
The null hypothesis (no difference between the two flanks) could be confidently rejected in all cases (p\--value$<$0.001 except for UniBacilli), the sign of the H-statistic (\gls{kw}) indicating suppression at the inside and/or preference for the outside flank (except for UniArch).
Most importantly, acidic residues were found to be distributed with bias in ExpAll (p\--value$<$3.47e-58) and in UniHuman (p\--value=1.13e-93).
Whereas with UniBacilli, the problem is most likely the dataset size, the exception of UniArch, for which we observe a strong negative inside rule, is more puzzling and indicates biophysical differences of their plasma-membrane.
% Table generated by Excel2LaTeX from sheet 'Sheet1'
\begin{table}[htbp!]
\centering
\captionof{table}[Statistical significances for negative charge distribution skew on either side of the membrane in single\--pass transmembrane helices.]{\textbf{Statistical significances for negative charge distribution skew on either side of the membrane in single\--pass transmembrane helices.} The “Helices” column refers to the total~\gls{tmh}s contained in each dataset (ExpALL,~\gls{tmh}s from TOPDB~\cite{Dobson2015}; UniHuman, human representative proteome; UniER, human endoplasmic reticulum representative proteome; UniGolgi, human Golgi representative proteome; UniPM, human plasma membrane representative proteome; UniCress, Arabidopsis thaliana (mouse-ear cress) representative proteome; UniFungi, fungal representative proteome; UniBacilli, Bacilli class representative proteome; UniEcoli, Escherichia coli representative proteome; UniArch, Archaea representative proteome; see Methods for details).
In the ``Database-defined flanks'' column, the ``Negative residues'' column refers to the total number of negative residues found in the $\pm$10 flanking residues on either side of the~\gls{tmh} and does not include residues found in the helix itself.
In the ``Flanks after central alignment'' column, the ``Negative residues'' column refers to the total number of negative residues found in the –20 to –10 residues and the +10 to +20 residues from the centrally aligned residues of the~\gls{tmh}.
Unlike the other tables, the global averages are derived from the $\pm$20 datasets.
The~\gls{kw} scores were calculated for negative residues by comparing the number of negatively\--charged residues that were within the 10 inside residues and the 10 outside residues in either case}
\resizebox{\textwidth}{!}{
\begin{tabular}{p{5em}lllllllll}
\toprule
\multicolumn{2}{p{10em}}{\textbf{single\--pass}} & \multicolumn{4}{p{20em}}{\textbf{Database-defined flanks}} & \multicolumn{4}{p{20em}}{\textbf{Flanks after central alignment}} \\
\midrule
\multirow{2}[4]{*}{\textbf{Data-set}} & \multicolumn{1}{c}{\multirow{2}[4]{*}{\textbf{Helices}}} & \multicolumn{2}{p{10em}}{\textbf{Negative residues}} & \multicolumn{1}{c}{\multirow{2}[4]{*}{\textbf{H statistic}}} & \multicolumn{1}{c}{\multirow{2}[4]{*}{\textbf{p\--value}}} & \multicolumn{2}{p{10em}}{\textbf{Negative residues}} & \multicolumn{1}{c}{\multirow{2}[4]{*}{\textbf{H statistic}}} & \multicolumn{1}{c}{\multirow{2}[4]{*}{\textbf{p\--value}}} \\
\cmidrule{3-4}\cmidrule{7-8} \multicolumn{1}{c}{} & & \multicolumn{1}{p{5em}}{\textbf{Inside}} & \multicolumn{1}{p{5em}}{\textbf{Outside}} & & & \multicolumn{1}{p{5em}}{\textbf{Inside}} & \multicolumn{1}{p{5em}}{\textbf{Outside}} & & \\
\midrule
ExpAll & 1544 & 848 & 1648 & 258.59 & 3.47E-58 & 735 & 1541 & 262.29 & 5.44E-59 \\
\midrule
UniHuman & 1705 & 780 & 1922 & 421.53 & 1.13E-93 & 652 & 1865 & 501.86 & 3.74E-111 \\
\midrule
UniER & 132 & 78 & 156 & 23.76 & 1.09E-06 & 76 & 150 & 21.62 & 3.33E-06 \\
\midrule
UniGolgi & 206 & 60 & 240 & 104.45 & 1.61E-24 & 54 & 239 & 107.18 & 4.06E-25 \\
\midrule
UniPM & 493 & 197 & 578 & 177.68 & 1.56E-40 & 161 & 569 & 215.18 & 1.02E-48 \\
\midrule
UniCress & 632 & 314 & 450 & 18.23 & 1.96E-05 & 231 & 444 & 55.8 & 8.01E-14 \\
\midrule
UniFungi & 729 & 449 & 631 & 28.15 & 1.12E-07 & 413 & 627 & 38.08 & 6.79E-10 \\
\midrule
UniBacilli & 124 & 90 & 113 & 3.73 & 5.35E-02 & 86 & 106 & 2.53 & 1.12E-01 \\
\midrule
UniEcoli & 54 & 32 & 77 & 17.24 & 3.30E-05 & 30 & 74 & 14.74 & 1.24E-04 \\
\midrule
UniArch & 48 & 113 & 8 & 49.66 & 1.83E-12 & 96 & 7 & 45.62 & 1.43E-11 \\
\bottomrule
\end{tabular}}%
\label{table:negativeskewsinglepass}
\end{table}%
\subsection{Amino acid residue distribution analysis reveals a general negative\--charge bias signal in outside flank of multi\--pass transmembrane helix segments --- the negative outside enrichment rule}\label{section:negativeskewmultipass}
As a result of the rarity of negatively\--charged residues, any distribution bias is difficult to be recognised in the plot showing the total abundance (or alignment column composition) of residues in multi\--pass~\gls{tmh}s and their flanks from UniHuman and ExpAll (Figure~\ref{fig:multi_pass_charge_distribution}).
Yet, as with single\--pass helices, the dominant general leucine enrichment, as well as positive inside signal, can be identified with certainty.
When the residue occurrence is normalised by the total occurrence of this residue type in the sequence regions studied (shown as a relative percentage of at each position for multi\--pass helices from UniHuman and ExpAll in Figure~\ref{fig:multi_pass_charge_distribution}), the bias in the distribution of any type of charged residues becomes visible.
\begin{figure}[!p]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/multi_pass_charge_distribution}
\captionof{figure}[Negative-outside bias is very subtle in transmembrane helices from multi\--pass proteins.]{\textbf{Negative-outside bias is very subtle in transmembrane helices from multi\--pass proteins.} The meaning for the horizontal axis is the same as in Figure~\ref{fig:single_pass_charge_distribution}, with the negative sequence position numbers extending towards the cytoplasmic space.
Leucine is in blue.
Arginine and lysine are shown in dark and light orange respectively.
Aspartic and glutamic acid are shown in dark and light purple respectively.
In red are the uncharged polar amino acids serine, asparagine, glutamine, threonine, tyrosine, and cysteine.
All flank sizes were set at up to $\pm$20 residues.
(a) and (b) On the vertical axes are the absolute abundances of residues from~\gls{tmh}s of multi\--pass proteins from (a) UniHuman and (b) ExpAll.
c and d On the vertical axes are the relative percentages at each position for~\gls{tmh}s from multi\--pass proteins from (c) UniHuman and (d) ExpAll.
As in Figure~\ref{fig:single_pass_charge_distribution}(c) and (d), the dashed lines show the estimation of the background level of residues with respect to the colour, and the thick bars show the averages on the inner and outer flanks coloured to the respective amino acid type.
e and f The relative distribution of flanks defined by the databases with the distance from the~\gls{tmh} boundary on the horizontal axis for both the inside and outside flanks.
The colouring is the same as in (a) and (b).}
\label{fig:multi_pass_charge_distribution}
\end{figure}
With regard to the positive-inside preference, positively\--charged residues have a background value of 2.0\% for arginine and 2.2\% for lysine in UniHuman, and 1.7\% for arginine and 1.9\% for lysine in ExpAll.
At the inside flank, this rises to 4.6\% for arginine and 4.1\% for lysine in UniHuman and 4.6\% for arginine and 4.2\% for lysine in ExpAll.
The mean net charge at each position was calculated for multi\--pass and single\--pass datasets from UniHuman and ExpAll (Figure \ref{fig:net_charge}).
The positive inside rule clearly becomes visible as the net charge has a positive skew approximately between residues -10 and -25.
What is noteworthy is that the peaks found for single\--pass helices were almost three times greater than those of multi\--pass helices.
For single\--pass~\gls{tmh}s, the peak is +0.30 at position -15 in UniHuman and +0.31 at position -14 in ExpAll, whereas~\gls{tmh}s from multi\--pass proteins had lower peaks of +0.15 at position -13 in UniHuman and +0.10 at position -14 in ExpAll.
Thus, there is a positive charge bias towards the cytoplasmic side; yet, it is much weaker for multi\--pass than for single\--pass~\gls{tmh}s.
\begin{figure}[!ht]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/net_charge}
\captionof{figure}[The net charge across multi\--pass and single\--pass transmembrane helices shows a stronger positive inside charge in single\--pass transmembrane helices than multi\--pass transmembrane helices.]{\textbf{The net charge across multi\--pass and single\--pass transmembrane helices shows a stronger positive inside charge in single\--pass transmembrane helices than multi\--pass transmembrane helices.}
The net charge per~\gls{tmh} plotted at each position; the positive-inside rule is stronger in~\gls{tmh}s from single\--pass proteins than~\gls{tmh}s from multi\--pass proteins.
The net charge was calculated at each position as described in the Methods section for the (A) UniHuman and (B) ExpAll datasets.
Net charge for~\gls{tmh}s from multi\--pass proteins is shown in black, and the profile of~\gls{tmh}s from single\--pass proteins is drawn in blue.}
\label{fig:net_charge}
\end{figure}
Notably, a ``negative outside enrichment'' trend also can be seen from the distribution of the negatively\--charged residues, though with some effort (Table \ref{table:multipassstats}) as the effect is also weaker than in the case of single\--pass~\gls{tmh}s.
We studied the flanks under four conditions: (i) database-defined flanks without overlap between neighbouring~\gls{tmh}s, (ii) flanks after central alignment of~\gls{tmh}s without flank overlap, (iii) database-defined flanks but allowing overlap of flanks shared among neighbouring~\gls{tmh}s, (iv) same as condition (ii) but only the subset of cases where there is at least half of the required flank length at either side of the~\gls{tmh}.
In UniHuman as calculated under condition (i), aspartic acid is lower on the inside flank (2.3\%) than on the outside flank (3.0\%).
Glutamic acid is also lower at the inside flank (2.4\%) than the 2.8\% on the outside flank (Figure~\ref{fig:multi_pass_charge_distribution}C).
Slight variations in defining the membrane boundary point do not influence the trend (compare figures~\ref{fig:multi_pass_charge_distribution}C and~\ref{fig:multi_pass_charge_distribution}E).
We find that, in all studied conditions, the UniHuman dataset delivers statistical significances (p\--values: (i) 6.10e-34, (ii) 5.43e-41, (iii) 3.00e-57, (iv) 5.60e-41) strongly supporting negative\--charge bias (inside suppression/outside preference; see Table~\ref{table:multipassstats}).
As with the single\--pass proteins, we checked if this probability was spread throughout non\--charged amino acids as well as negatively\--charged amino acids by examining non-charged polar residues for inside versus outside preference (Figure~\ref{fig:multi_pass_charge_distribution}B and Figure~\ref{fig:multi_pass_charge_distribution}C).
There was no clear difference between the inside and outside flank relative percentages for ExpAll since the inside flank was 1.9\% (relative percentage average) and the outside flank was 2.0\% (relative percentage average).
There was some small difference in the UniHuman dataset with the inside average being 1.9\% and the outside average being 2.2\%.
This however is much less of a difference than the negatively\--charged residue flank differences in the UniHuman dataset.
% Table generated by Excel2LaTeX from sheet 'Sheet1'
\begin{table}[htbp]
\centering
\captionof{table}[Statistical significances for negative charge distribution skew on either side of the membrane in multi\--pass transmembrane helices.]{\textbf{Statistical significances for negative charge distribution skew on either side of the membrane in multi\--pass transmembrane helices.}
The ``Helices'' column refers to the total~\gls{tmh}s contained in each dataset (ExpALL,~\gls{tmh} from TOPDB~\cite{Dobson2015}; UniHuman, human representative proteome; UniER, human endoplasmic reticulum representative proteome; UniGolgi, human Golgi representative proteome; UniPM, human plasma membrane representative proteome; UniCress, Arabidopsis thaliana (mouse-ear cress) representative proteome, UniFungi, fungal representative proteome; UniBacilli, Bacilli class representative proteome; UniEcoli, Escherichia coli representative proteome; UniArch, Archaea representative proteome; see Methods for details).
In (A) the ``Database-defined flanks'' and in (B) the ``Database-defined viable* flanks'' and the ``Overlapping flanks'' columns, the ``Negative residues'' column refers to the total number of negative residues found in the $\pm$10 flanking residues on either side of the~\gls{tmh} and does not include residues found in the~\gls{tmh} itself.
(A) In the ``Flanks after central alignment'' column, the ``Negative residues'' column refers to the total number of negative residues found in the –20 to –10 residues and the +10 to +20 residues from the centrally aligned residues with a maximum database defined flank length of 20 residues.
The total number of proteins is given in the IDs column.
The ``Helices'' column contains the total number of~\gls{tmh}s in the dataset (n), the average number of~\gls{tmh}s per protein in that population ($\mu$) and the standard deviation of that average ($\sigma$).
The~\gls{kw} scores were calculated for negative residues by comparing the number of negatively\--charged residues that were within 10 residues inside and 10 residues outside the~\gls{tmh}.
*Here, ``viable'' indicates that in each~\gls{tmh} used for both flanks either side of the~\gls{tmh} has a flank length of at least half the maximum allowed flank length, in this case 10 (the viable length is 5)}
\resizebox{\textwidth}{!}{(A)
\begin{tabular}{ p{5em} l l l l l l l l l l l l }
\toprule
\multicolumn{5}{ p{25em} }{multi\--pass} & \multicolumn{4}{p{20em} }{Database-defined flanks} & \multicolumn{4}{p{20em} }{Flanks after central alignment} \\
\midrule
\multirow{2}[4]{*}{Data-set} & \multicolumn{1}{l }{\multirow{2}[4]{*}{IDs}} & \multicolumn{3}{p{15em} }{Helices} & \multicolumn{2}{p{10em} }{Negative residues} & \multicolumn{1}{l }{\multirow{2}[4]{*}{H statistic}} & \multicolumn{1}{l }{\multirow{2}[4]{*}{p\--value}} & \multicolumn{2}{p{10em} }{Negative residues} & \multicolumn{1}{l }{\multirow{2}[4]{*}{H statistic}} & \multicolumn{1}{l }{\multirow{2}[4]{*}{p\--value}} \\
\cmidrule{3-7}\cmidrule{10-11} \multicolumn{1}{ l }{} & & \multicolumn{1}{p{5em} }{\textit{n}} & \multicolumn{1}{p{5em} }{$\mu$} & \multicolumn{1}{p{5em} }{$\sigma$} & \multicolumn{1}{p{5em} }{Inside} & \multicolumn{1}{p{5em} }{Outside} & & & \multicolumn{1}{p{5em} }{Inside} & \multicolumn{1}{p{5em} }{Outside} & & \\
\midrule
ExpAll & 2205 & 15,563 & 7.07 & 3.95 & 9709 & 9598 & 0.04 & 8.43E-01 & 9648 & 9659 & 0.35 & 5.56E-01 \\
\midrule
UniHuman & 1789 & 12,353 & 6.93 & 3.2 & 7196 & 9164 & 147.5 & 6.10E-34 & 6740 & 8968 & 179.77 & 5.43E-41 \\
\midrule
UniER & 155 & 898 & 5.85 & 3.2 & 630 & 584 & 0.44 & 5.08E-01 & 578 & 576 & 0.03 & 8.58E-01 \\
\midrule
UniGolgi & 61 & 383 & 6.28 & 2.97 & 274 & 261 & 0.02 & 8.75E-01 & 266 & 259 & 0.09 & 7.65E-01 \\
\midrule
UniPM & 427 & 3079 & 7.22 & 3.3 & 1945 & 2499 & 47.98 & 4.30E-12 & 1791 & 2440 & 64.42 & 1.01E-15 \\
\midrule
UniCress & 507 & 3823 & 7.55 & 3.32 & 2567 & 2426 & 0.73 & 3.93E-01 & 2398 & 2433 & 1.11 & 2.93E-01 \\
\midrule
UniFungi & 1338 & 8685 & 6.5 & 3.75 & 5560 & 5266 & 5.83 & 1.57E-02 & 5140 & 5214 & 0 & 9.62E-01 \\
\midrule
UniBacilli & 140 & 822 & 5.94 & 3.98 & 470 & 468 & 0.07 & 7.92E-01 & 450 & 471 & 0.92 & 3.38E-01 \\
\midrule
UniEcoli & 529 & 3888 & 7.39 & 3.76 & 1990 & 1902 & 0.26 & 6.07E-01 & 1875 & 1887 & 0.18 & 6.71E-01 \\
\midrule
UniArch & 59 & 327 & 5.97 & 2.73 & 245 & 175 & 7.98 & 4.72E-03 & 235 & 181 & 7.08 & 7.81E-03 \\
\bottomrule
\end{tabular}
}
\\
\resizebox{\textwidth}{!}{(B)
% Table generated by Excel2LaTeX from sheet 'Sheet1'
\begin{tabular}{ p{5em} l l l l l l l l llll }
\toprule
multi\--pass & \multicolumn{4}{p{20em} }{Overlapping flanks} & \multicolumn{8}{p{40em} }{Database-defined viable* flanks} \\
\midrule
\multirow{2}[4]{*}{Data-set} & \multicolumn{2}{p{10em} }{Negative residues} & \multicolumn{1}{l }{\multirow{2}[4]{*}{H statistic}} & \multicolumn{1}{l }{\multirow{2}[4]{*}{p\--value}} & \multicolumn{1}{l }{\multirow{2}[4]{*}{\textit{N}}} & \multicolumn{2}{p{10em} }{Negative residues} & \multicolumn{1}{l }{\multirow{2}[4]{*}{H statistic}} & \multicolumn{4}{l }{\multirow{2}[4]{*}{p\--value}} \\
\cmidrule{2-3}\cmidrule{7-8} \multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Inside} & \multicolumn{1}{p{5em} }{Outside} & & & & \multicolumn{1}{p{5em} }{Inside} & \multicolumn{1}{p{5em} }{Outside} & & \multicolumn{4}{l }{} \\
\midrule
ExpAll & 11,969 & 12,615 & 22.54 & 2.05E-06 & 8808 & 6082 & 6916 & 59.93 & \multicolumn{4}{l }{9.81E-15} \\
\midrule
UniHuman & 8645 & 11,181 & 254.3 & 3.00E-57 & 8183 & 5169 & 6915 & 179.71 & \multicolumn{4}{l }{5.60E-41} \\
\midrule
UniER & 750 & 763 & 1.16 & 2.81E-01 & 516 & 398 & 441 & 3.16 & \multicolumn{4}{l }{7.55E-02} \\
\midrule
UniGolgi & 333 & 369 & 7.12 & 7.64E-03 & 195 & 162 & 186 & 3 & \multicolumn{4}{l }{8.30E-02} \\
\midrule
UniPM & 2319 & 3107 & 99.68 & 1.79E-23 & 1977 & 1343 & 1960 & 98.63 & \multicolumn{4}{l }{3.05E-23} \\
\midrule
UniCress & 3142 & 3298 & 9.21 & 2.41E-03 & 2110 & 1626 & 1741 & 6.4 & \multicolumn{4}{l }{1.14E-02} \\
\midrule
UniFungi & 6724 & 6814 & 0.46 & 4.96E-01 & 4581 & 3340 & 3411 & 0.41 & \multicolumn{4}{l }{5.22E-01} \\
\midrule
UniBacilli & 585 & 636 & 2.65 & 1.04E-01 & 382 & 230 & 306 & 12.73 & \multicolumn{4}{l }{3.61E-04} \\
\midrule
UniEcoli & 2574 & 2800 & 17.88 & 2.35E-05 & 1596 & 951 & 1114 & 16.57 & \multicolumn{4}{l }{4.69E-05} \\
\midrule
UniArch & 342 & 248 & 14.67 & 1.28E-04 & 132 & 120 & 104 & 0.28 & \multicolumn{4}{l }{5.97E-01} \\
\bottomrule
\end{tabular}%
}
\label{table:multipassstats}
\end{table}
Surprisingly, the result could not straightforwardly be repeated with the considerably smaller ExpAll.
Under condition (i), we find with ExpAll that aspartic acid has a background level of 1.0\%, an average of 2.6\% on the inside flank, and of 2.9\% on the outside flank but glutamic acid’s background is 1.2\% but 2.8\% on the inside flank and 2.5\% on the outside flank.
Statistical tests do not support finding a negative\--charge bias in conditions (i) and (ii).
Apparently, the problem is~\gls{tmh}s having no or almost no flanks at one of the sides.
Statistical significance for the negative\--charge bias is detected as soon as this problem is dealt with – either by allowing extension of flanks overlap among neighbouring~\gls{tmh}s as in condition (iii) or by removing examples without proper flank lengths from the dataset as in condition (iv).
The respective p\--values are 2.05e-6 and 9.81e-15 respectively.
The issues we had with ExpAll raised the question that, maybe, sequence redundancy in the UniHuman set could have played a role.
Therefore, we repeated all calculations but with UniRef50 instead of UniRef90 for mapping into sequence clusters (see Methods section for detail).
We were surprised to see that harsher sequence redundancy requirements do not affect the outcome of the statistical tests in any major way.
For the conditions (i)- (iv), we computed the following p\--values: (i) 1.31e-28 (5940 negatively residues inside versus 7492 outside), (ii) 1.38e-36 (5516 versus 7320), (iii) 5.60e-53 (7089 versus 9233) and (iv) 4.18e-41 (4232 versus 5730).
So, the amplifying effect of some subsets in the overall dataset on the statistical test that might be caused by allowing overlapping flanks (condition (iii)) is not the major factor leading to the negative charge skew.
Similarly, the trend is also not caused by sequence redundancy.
Thus, we have learned that the negative\--charge bias does also exist in multi\--pass~\gls{tmp}s but under the conditions that there are sufficiently long loops between~\gls{tmh}s.
Bluntly said: no loops equals to no charge bias.
As soon as the loops reach some critical length, there are differences between single\--pass and multi\--pass~\gls{tmh}s with regard to occurrence and distribution of negative charges and the inside-suppression/outside-enrichment negative\--charge bias appears.
Not only are there more negative charges within the multi\--pass~\gls{tmh} itself (in fact, negative charges are almost not tolerated in single\--pass~\gls{tmh}s; see Table \ref{table:acidicresiduesarerare}), but also, there is a much stronger negative outside skew in the~\gls{tmh}s of single\--pass proteins than those of multi\--pass proteins.
\subsection{Further significant sequence differences between single\--pass and multi\--pass helices: distribution of tryptophan, tyrosine, proline and cysteine}
Amino acid residue profiles along the~\gls{tm} segment and its flanks differ between single- and multi\--pass~\gls{tmh}s also in other aspects.
The relative percentages of all amino acid types (normalisation by the total amount of that residue type in the sequence segment) from single\--pass helices of the UniHuman (Figure \ref{fig:comp_heatmaps}A; from 1705~\gls{tmh}s with flanks having 68571 residues) and ExpAll (Figure \ref{fig:comp_heatmaps}B; from 1544~\gls{tmh}s with flanks having 60200 residues) were plotted as a heat-map.
The amino acid types were listed on the Y axis according to Kyte \& Doolittle hydrophobicity~\cite{Kyte1982} in descending order.
\begin{figure}[p]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/comp_heatmaps}
\captionof{figure}[Relative percentage heat-maps from predictive and experimental datasets corroborate residue distribution differences between transmembrane helices from single\--pass and multi\--pass proteins.]{\textbf{Relative percentage heat-maps from predictive and experimental datasets corroborate residue distribution differences between transmembrane helices from single\--pass and multi\--pass proteins.}
The residue position aligned to the centre of the~\gls{tmh} is on the horizontal axis, and the residue type is on the vertical axis.
Amino acid types are listed in order of decreasing hydrophobicity according to the Kyte and Doolittle scale [52].
The flank lengths in the~\gls{tmh} segments were restricted to up to $\pm$10 residues.
The scales for each heat-map are shown beneath the respective subfigure.
The darkest blue represents 0\% distribution, whilst the darkest red represents the maximum relative percentage distribution that is denoted by the keys in each subfigure, with white being 50\% between ``cold'' and ``hot''.
The central~\gls{tmh} subplots extend from the central~\gls{tmh} residue, whereas the inner and outer flank subplots use the database-defined~\gls{tmh} boundary and extend from that position.
a~\gls{tmh}s from the single\--pass UniHuman dataset.
b single\--pass protein~\gls{tmh}s from the ExpAll dataset.
c~\gls{tmh}s from the proteins of the multi\--pass UniHuman dataset.
d~\gls{tmh}s from ExpAll multi\--pass proteins.
The general consistency in relative distributions of every residue type between single\--pass and multi\--pass of either dataset including flank/\gls{tmh} boundary selection allows us to infer biological conclusions from these distributions that are independent of methodological biases used to gather the sequences.
The only residue that displays drastically differently between the datasets is cysteine in multi\--pass~\gls{tmh}s only.
The most striking differences in distributions between residues from~\gls{tmh}s of single\--pass and multi\--pass proteins include a more defined Y and W clustering at the flanks, a suppression of E and D on the inside flank, a suppression of P on the inside flank and a topological bias for C favouring the inside flank.}
\label{fig:comp_heatmaps}
\end{figure}
In accordance with expectations, enrichment for hydrophobic residues in the~\gls{tmh}, for the positively\--charged residues on the inside flank as well as a distribution the negative distribution bias was found in both datasets.
Additionally, the inside interfacial region showed consistent enrichment hotspots for tryptophan (e.g., 7.1\% at position -11 in ExpAll, 6.2\% at position -10 in UniHuman with flanks after central~\gls{tmh} alignment) and tyrosine (6.4\% at -11 in ExpAll, 7.1\% at -11 in UniHuman), and some preference can also be seen for the outer interfacial region (\textit{e.g.}, 5.2\% at position 11 for tryptophan in ExpAll, and 5.8\% at position 10 for tryptophan in UniHuman) albeit the ``hot'' cluster of the outer flank covers fewer positions than that of the inner flank.
Further, there is an apparent bias of cysteine on the inner flank and interfacial region (e.g., 5.5\% at position -10 in ExpAll, 5.9\% at position -11 in UniHuman), and a depression in the outer interfacial region and flank (up to a minimum of 0.3\% in both ExpAll and UniHuman).
Proline appears to have a depression signal on the outer flank.
Note that, in a similar way to Figures \ref{fig:single_pass_charge_distribution} and \ref{fig:multi_pass_charge_distribution}, the distributions of the flanks derived from centrally aligned~\gls{tmh}s are corroborated by the distributions from the database defined~\gls{tmh} boundary flanks (see outside bands in Figures \ref{fig:comp_heatmaps}A-D).
A similar heatmap was generated for UniHuman multi\--pass (Figure \ref{fig:comp_heatmaps}C; from 12353~\gls{tmh}s with flanks having 452708 residues)~\gls{tmh}s and ExpAll multi\--pass (Figure \ref{fig:comp_heatmaps}D; from 15563~\gls{tmh}s with flanks having 535599 residues).
Whereas Figures \ref{fig:comp_heatmaps}A-C appear quite noisy, the plot for ExpAll multi\--pass~\gls{tmh}s appears almost Gaussian-like smoothed, thus, indicating the quality of this dataset.
Tyrosine and tryptophan in the multi\--pass case do not appear as enriched in the interfacial regions of single\--pass~\gls{tmh}s from both UniHuman and ExpAll.
Prolines are only suppressed in the~\gls{tmh} itself and are not suppressed in the outer flank as in the single\--pass case but, indeed, are tolerated if not slightly enriched in the flanks.
\subsection{Hydrophobicity and leucine distribution in transmembrane helices in single- and multi\--pass proteins}
Generally, we see in Figure \ref{fig:comp_heatmaps} that compositional biases appear more extreme in the single\--pass case, particularly when it comes to polar and non-polar residues being more heavily suppressed and enriched.
To investigate this observation, we calculated the hydrophobicity at each sequence-position averaged over all~\gls{tmh}s considered (after having window-averaged over 3 residues for each~\gls{tmh}) using the Kyte \& Doolittle hydrophobicity scale~\cite{Kyte1982} (Figure~\ref{fig:hydrophobicity_single_multi}A) and validated using White and Wimley octanol-interface whole residue scale~\cite{White1999}, Hessa’s biological hydrophobicity scale~\cite{Hessa2005}, and the Eisenberg hydrophobic moment consensus scale~\cite{Eisenberg1984} (Figure~\ref{fig:hydrophobicity_scale_comparison}).
The total set of~\gls{tmh}s was split into 15 sets of membrane-spanning proteins (1 set containing single\--pass proteins, 13 sets each containing~\gls{tmh}s from 2-, 3-, 4-\ldots 14-\gls{tmp}s and another of~\gls{tmh}s from proteins with 15 or more~\gls{tmh}s).
In Figure~\ref{fig:hydrophobicity_single_multi}B, we show the p\--value at each sequence position by comparing the respective values from multi\--pass and single\--pass~\gls{tmh}s using the 2-sample t-test (Figure \ref{fig:hydrophobicity_single_multi}B).
Strikingly, the inside flank of the single\--pass~\gls{tmh}s is much more hydrophilic (e.g., see the Kyte \& Doolittle score=-1.3 at position -18) than that of multi\--pass~\gls{tmh}s (p\--value=5.64e-103 at position -14).
Most likely, the positive inside rule, along with the interfacial clustering of tryptophan and tyrosine, contribute to a strong polar inside flank in single\--pass helices that is not present in multi\--pass helices en masse.
Further, multi\--pass~\gls{tmh}s cluster remarkably closely within the~\gls{tm} core; the respective hydrophobicity is apparently not dependent on the number of~\gls{tmh}s in a given multi\--pass~\gls{tmp}.
On average, single\--pass~\gls{tmh}s are more hydrophobic in the core than multi\--pass~\gls{tmh}s (p\--value$<$1.e-72 within positions -5…5 and p\--value=5.92e-190 at position 0).
On the other hand, hydrophobicity differences between~\gls{tmh}s from single- and multi\--pass proteins fade somewhat at the transition towards the flanks (p\--value=1.85e-4 at position -10, and p\--value=3.35e-31 at position 10).
\begin{figure}[!ht]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/hydrophobicity_single_multi}
\captionof{figure}[There is a difference in the hydrophobic profiles of transmembrane helices from single\--pass and multi\--pass proteins.]{\textbf{There is a difference in the hydrophobic profiles of transmembrane helices from single\--pass and multi\--pass proteins.}
a The hydrophobicity of single\--pass~\gls{tmh}s compared to multi\--pass segments from the UniHuman dataset.
The Kyte and Doolittle scale of hydrophobicity~\cite{Kyte1982} was used with a window length of 3 to compare~\gls{tmh}s from proteins with different numbers of~\gls{tmh}s.
This scale is based on the water-vapour transfer of free energy and the interior-exterior distribution of individual amino acids.
The same datasets also had different scales applied (Figure~\ref{fig:hydrophobicity_scale_comparison}).
The vertical axis is the hydrophobicity score, whilst the horizontal axis is the position of the residue relative to the centre of the~\gls{tmh}, with negative values extending into the cytoplasm.
In black are the average hydrophobicity values of~\gls{tmh}s belonging to single\--pass~\gls{tmh}s, whilst in other colours are the average hydrophobicity values of~\gls{tmh}s belonging to multi\--pass proteins containing the same numbers of~\gls{tmh}s per protein.
In purple are the~\gls{tmh}s from proteins with more than 15~\gls{tmh}s per protein that do not share a typical multi\--pass profile, perhaps due to their exceptional nature.
b The Kruskal-Wallis test (H statistic) was used to compare single\--pass windowed hydrophobicity values with the average windowed hydrophobicity value of every~\gls{tmh} from multi\--pass proteins at the same position.
The vertical axis is the logarithmic scale of the resultant p\--values.
We can much more readily reject the hypothesis that hydrophobicity is the same between~\gls{tmh}s from single\--pass and multi\--pass proteins in the core of the helix and the flanks than the interfacial regions, particularly at the inner leaflet due to leucine asymmetry ( Table~\ref{table:leucineskewstats})}
\label{fig:hydrophobicity_single_multi}
\end{figure}
\begin{figure}[!ht]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/hydrophobicity_scale_comparison}
\captionof{figure}[There is a difference in the hydrophobic profiles of transmembrane helices from single\--pass and multi\--pass proteins.]{\textbf{There is a difference in the hydrophobic profiles of transmembrane helices from single\--pass and multi\--pass proteins.}
The difference in hydrophobicity between the single\--pass and multi\--pass datasets stratified by number of~\gls{tmh}s is not due to the choice of scale.
As with Figure~\ref{fig:hydrophobicity_single_multi}, UniHuman was stratified according to the number of~\gls{tmh}s in each protein.
The mean amino acid hydrophobicity values of~\gls{tmh}s with a sliding unweighted window of 3 residues from UniHuman proteins at each position were plotted.
To validate the findings presented in Figure \ref{fig:hydrophobicity_single_multi}A, several scales of hydrophobicity were used.
(A) The White and Wimley whole residue scale~\cite{White1999} is based on the partitioning of peptides between water and octanol as well as water to~\gls{popc}.
A positive score indicates a more polar score.
(B) The Hessa biological scale~\cite{Hessa2005}.
The hydrophobicity values represent the free energy exchange during recognition of designed peptide~\gls{tmh}s by the endoplasmic reticulum Sec61 translocon and, therefore, negative values indicate an energetic preference for the interior of a lipid bilayer.
(C) The Eisenberg consensus scale~\cite{Eisenberg1984} is a scale based on the earlier scales from Nozaki and Tanford~\cite{Nozaki1971}, Wolfenden \textit{et al.}~\cite{Wolfenden1981}, Chothia~\cite{Chothia1976}, Janin~\cite{Janin1979} and the von Heijne and Blomberg scale~\cite{VonHeijne1979}.
The scales are normalised according to serine.
A positive score indicates a generally more hydrophobic score.}
\label{fig:hydrophobicity_scale_comparison}
\end{figure}
Leucine is the most abundant residue in~\gls{tmh}s (Figure~\ref{fig:amino_acid_distribution}) and is considered one of the most hydrophobic residues by all hydrophobicity scales.
Therefore, it plays a very influential role in~\gls{tmh} helix-helix and lipid-helix interactions in the membrane and recognition by the insertion machinery.
When looking at the difference in the abundance of leucine between the inner and outer halves, we find that~\gls{tmh}s from single\--pass proteins have a trend to contain more leucine residues at the cytoplasmic side of~\gls{tmh}s, particularly in the case of~\gls{tmh}s from single\--pass proteins (see Figures~\ref{fig:single_pass_charge_distribution} and~\ref{fig:comp_heatmaps}).
This trend is statistically significant for~\gls{tmh}s in many biological membranes (Table~\ref{table:leucineskewstats}, Figure~\ref{fig:dataset_distributions}).
In the most extreme case of UniCress (single\--pass), we see 49\% more leucine residues on the inside leaflet than the outside leaflet (p\--value=5.41e-24).
This contrasts with UniCress (multi\--pass), in which the skew is far weaker, albeit yet statistically significant.
There are 6\% more leucine residues at the inside half (p\--value=2.08e-4).
The trend of having more leucine residues at the cytoplasmic half of the~\gls{tmh} is observed for all datasets (both single- and multi\--pass) except for UniArch (single\--pass).
The phenomenon is statistically significant with p\--value$<$1.e-3 for ExpAll, UniHuman, UniPM and UniCress (both single- and multi\--pass).
As with negative charge distribution, UniArch presents a reversed effect compared to other single\--pass protein datasets with a 57\% reduction in leucine on the inside leaflet compared to the outside leaflet (p\--value=7.25e-6).
However, leucine of~\gls{tmh}s from UniArch multi\--pass proteins have no discernible preference for the inside leaflets (4\% more on the inside leaflet, p\--value=0.625).
% Table generated by Excel2LaTeX from sheet 'Sheet1'
\begin{table}[htbp]
\centering
\captionof{table}[Leucines at the inner and outer leaflets of the membrane in transmembrane helices.]{\textbf{Leucines at the inner and outer leaflets of the membrane in transmembrane helices.}
The statistical results when comparing the number of leucine residues from the inner and outer leaflets in each protein in the dataset.
The number of helices per dataset can be found in Table~\ref{table:acidicresiduesarerare}.
The Kruskal-Wallis test scores (H statistics) were calculated for leucine residues by comparing the number of leucine residues that were in the inner half of the leaflet with those in the outer half of the leaflet of the database-defined TMH}
\resizebox{\textwidth}{!}{
\begin{tabular}{ p{5em} l l r r r l l r r r }
\toprule
\multirow{2}[4]{*}{\textbf{Dataset}} & \multicolumn{5}{p{25em} }{\textbf{single\--pass}} & \multicolumn{5}{p{25em} }{\textbf{multi\--pass}} \\
\cmidrule{2-11} \multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{\textbf{Inside}} & \multicolumn{1}{p{5em} }{\textbf{Outside}} & \multicolumn{1}{p{5em} }{\textbf{Percentage}} & \multicolumn{1}{p{5em} }{\textbf{H statistic}} & \multicolumn{1}{p{5em} }{\textbf{p\--value}} & \multicolumn{1}{p{5em} }{\textbf{Inside}} & \multicolumn{1}{p{5em} }{\textbf{Outside}} & \multicolumn{1}{p{5em} }{\textbf{Percentage}} & \multicolumn{1}{p{5em} }{\textbf{H statistic}} & \multicolumn{1}{p{5em} }{\textbf{p\--value}} \\
\midrule
ExpAll & 4020 & 3403 & 118.13 & 40.07 & 2.44E-10 & 27,986 & 27,008 & 103.62 & 14.13 & 1.70E-04 \\
\midrule
UniHuman & 4982 & 3697 & 134.76 & 193.02 & 6.99E-44 & 25,199 & 22,365 & 112.67 & 195.24 & 2.29E-44 \\
\midrule
UniER & 359 & 297 & 120.88 & 8.41 & 3.72E-03 & 1863 & 1764 & 105.61 & 3.98 & 4.61E-02 \\
\midrule
UniGolgi & 604 & 513 & 117.74 & 10.74 & 1.05E-03 & 753 & 677 & 111.23 & 5.61 & 1.79E-02 \\
\midrule
UniPM & 1485 & 1006 & 147.61 & 98.9 & 2.65E-23 & 6221 & 5577 & 111.55 & 35.21 & 3.00E-09 \\
\midrule
UniCress & 1495 & 1005 & 148.76 & 102.05 & 5.41E-24 & 6491 & 6099 & 106.43 & 13.76 & 2.08E-04 \\
\midrule
UniFungi & 1389 & 1308 & 106.19 & 3.41 & 6.48E-02 & 14,505 & 14,099 & 102.88 & 6.74 & 9.41E-03 \\
\midrule
UniBacilli & 260 & 251 & 103.59 & 0.03 & 8.72E-01 & 1488 & 1335 & 111.46 & 7.59 & 5.89E-03 \\
\midrule
UniEcoli & 130 & 100 & 130 & 2.78 & 9.53E-02 & 7251 & 6975 & 103.96 & 5.92 & 1.50E-02 \\
\midrule
UniArch & 51 & 118 & 43.22 & 20.13 & 7.25E-06 & 636 & 612 & 103.92 & 0.24 & 6.25E-01 \\
\bottomrule
\end{tabular}
}%
\label{table:leucineskewstats}
\end{table}%
\begin{figure}[!ht]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/dataset_distributions}
\captionof{figure}[Comparing charged amino acid distributions in transmembrane helices of multi\--pass and single\--pass proteins across different species and organelles.]{\textbf{Comparing charged amino acid distributions in transmembrane helices of multi\--pass and single\--pass proteins across different species and organelles.} The relative percentage distribution of charged residues and leucine was calculated at each position in the~\gls{tmh} with flank lengths of $\pm$20 in different datasets.
The distributions are normalised according to relative percentage distribution.
Aspartic acid and glutamic acid are shown in dark purple and light purple respectively.
Leucine, the most abundant non-polar residue in~\gls{tmh}s, is in blue.
Arginine and lysine are shown in orange.
TMHs from single\--pass proteins are on the left and~\gls{tmh}s from multi\--pass proteins are on the right for different taxonomic datasets: a UniCress, b UniFungi, c UniEcoli, d UniBacilli, e UniArch, and different organelles: f UniER, g UniGolgi, h UniPM.
As a trend, the negative-outside skew is more present in~\gls{tmh}s from single\--pass proteins than multi\--pass proteins (Tables 2 and 3).
Another key observation is that in single\--pass~\gls{tmh}s there is a propensity for leucine on the inner over the outer leaflet (Table \ref{table:leucineskewstats})}
\label{fig:dataset_distributions}
\end{figure}
\subsection{A negative-outside (or negative-non-inside) signal is present across many membrane types}
We explored the presence of amino acid residue compositional skews described above for human~\gls{tmp}s for those in other taxa and also specifically for human proteins with regard to membranes at various subcellular localisations.
Acidic residues for~\gls{tmh}s from single\--pass and multi\--pass helices were plotted according to their relative percentage distributions (of the total amount of this residue type in the respective segment) for five taxon-specific datasets UniCress (Figure~\ref{fig:dataset_distributions}A), UniFungi (Figure~\ref{fig:dataset_distributions}B), UniEcoli (Figure~\ref{fig:dataset_distributions}C), UniBacilli (Figure~\ref{fig:dataset_distributions}D), UniArch (Figure~\ref{fig:dataset_distributions}E) and for three organelle-specific datasets UniER (Figure~\ref{fig:dataset_distributions}F), UniGolgi (Figure~\ref{fig:dataset_distributions}G), UniPM (Figure~\ref{fig:dataset_distributions}H).
For single\--pass proteins in all taxon-specific datasets (with the exception of UniArch), there are more negative residues at the outside than at the inside.
The skew is statistically significant (see Table~\ref{table:negativeskewsinglepass}, P$<$0.001) except for UniBacilli.
Despite statistical significance found for UniFungi (p\--value=1.12e-7 for database-defined and p\--value=6.79e-10 for flanks after central alignment; Table~\ref{table:negativeskewsinglepass}), however, the trend is not very strong in this case (Figure~\ref{fig:dataset_distributions}B).
Whereas the skew is just a suppression of negatively\--charged residues at the inside flank for ExpAll and UniHuman (as well as in UniCress), the bias observed for UniEcoli involves also a negative charge enrichment at the outside flank.
In the case of UniArch (Figure~\ref{fig:dataset_distributions}E), we see a negative inside preference that is 6.0\% in the case of aspartic acid, and 6.3\% for glutamic acid (not shown), with much lower values close to 0\% on the outside.
Whilst the difference is statistically significant for both~\gls{tmh}s (Table~\ref{table:negativeskewsinglepass}) from single\--pass proteins (p\--value=1.83e-12 and p\--value=1.43e-11 for two versions of flank determination) and multi\--pass proteins (p\--values 4.72e-3, 7.81e-3, 1.28e-4 for three versions of flank determination, see Tables 3A and 3B), the distribution along the position axis is heavily fluctuating, maybe as a result of the small size of the dataset.
However, one can assuredly assign a ``negative-inside'' tendency to the flanking regions of Archaean~\gls{tmh}s.
In the human organelle datasets, we see trend shifts at different stages in the secretory pathway.
In UniER, there is an enrichment of negative charge on the outside flank of 1--1.5\% that is comparable to the magnitude of the positive inside signal.
In UniGolgi, there is a suppression of negatively\--charged residues on the inside flank as well as an enrichment on the inside flank resulting in \(\sim\)2\% distribution difference.
For UniPM, there is a negative-inside suppression (but no outside enrichment) as well as a positive-inside signal.
All observed trends are statistically significant (see Table~\ref{table:negativeskewsinglepass}, P$<$1.e-5).
For multi\--pass~\gls{tmh} proteins, we see either the same trends but in a weaker form or no skews are observed at all as inspection of the graphs in Figure~\ref{fig:dataset_distributions} shows.
For datasets UniER, UniGolgi, UniCress, UniFungi, and UniBacilli, the hypothesis of equal distribution of negatively\--charged residues cannot be rejected (p\--value$>$0.001, see Table \ref{table:multipassstats}); thus, a skew is statistically non-significant.
Although UniPM has a statistically significant bias (p\--value$<$4.30e-12, Table \ref{table:multipassstats}), the trends are more subtle and most present for aspartic acid of UniPM\@.
We see many more negative and positive charges tolerated within the multi\--pass~\gls{tmh}s themselves throughout all datasets (Table \ref{table:acidicresiduesarerare}).
To note, there is a positive-inside rule for all multi\--pass datasets studied herein.
To conclude, we find that negative-charge bias distribution is a feature of single\--pass protein~\gls{tmh}s that is present across many membrane types and it can have the form of a negative charge suppression at the inside flank or an enrichment of those charges at the outside flank.
\subsection{Amino acid compositional skews in relation to transmembrane helix complexity and anchorage function}
\begin{figure}[p]
\centering
\includegraphics[width=0.6\textheight]{NNI_chapter/complexity_datasets}
\captionof{figure}[Comparing the amino acid relative percentage distributions of simple and complex transmembrane helices from single\--pass proteins and transmembrane helices from multi\--pass proteins.]{\textbf{Comparing the amino acid relative percentage distributions of simple and complex transmembrane helices from single\--pass proteins and transmembrane helices from multi\--pass proteins.} Comparing the amino acid relative percentage distributions of simple and complex~\gls{tmh}s from single\--pass proteins and~\gls{tmh}s from multi\--pass proteins.
TMSOC was used to calculate which single\--pass~\gls{tmh}s were complex and which were simple from ExpAll and UniHuman datasets.
Simple~\gls{tmh}s are typically anchors without necessarily having other functions (Wong \textit{et al.}~\cite{Wong2010}).
The relative percentages from single\--pass simple (shown in light blue), single\--pass complex (red), and multi\--pass protein~\gls{tmh}s (black) were plotted for (a, c, e, g, i and k) UniHuman and (b, d, f, h, j and l) ExpAll for (a and b) positive residues, (c and d) negative residues, (e and f) tyrosine, (g and h) tryptophan, (i and j) leucine and (k and l) cysteine (m and n) uncharged polar amino acids; serine, asparagine, glutamine, threonine, tyrosine, and cysteine.
The slopes are statistically compared in Tables \ref{table:unihumanbahadur} and \ref{table:expallbahadur}, and as a trend, the profiles of complex~\gls{tmh}s are more similar to multi\--pass~\gls{tmh} profiles than simple~\gls{tmh}s are to multi\--pass~\gls{tmh}s}
\label{fig:complexity_datasets}
\end{figure}
In previous work, we studied the relationship of~\gls{tmh} composition, sequence complexity and function~\cite{Wong2010, Wong2011, Wong2012} and concluded that simple~\gls{tmh}s are more probably responsible for simple membrane anchorage, whereas complex~\gls{tmh}s have a biological function beyond just anchorage.
We wished to see how the skews observed in this work relate to that classification.
Therefore, the single\--pass~\gls{tmh}s from UniHuman and ExpAll were separated into subsets of simple, twilight, and complex~\gls{tmh}s using TMSOC~\cite{Wong2011, Wong2012}.
The relative percentages of eight residue types (L, D, E, R, K, Y, W, C\@; normalisation with the total amount of residues of that amino acid type in all sequence segments considered) were plotted along the sequence position for simple and complex helices (Figure~\ref{fig:complexity_datasets}).
Of UniHuman single\--pass proteins, there were 889 records with simple~\gls{tmh}s and 570 with complex~\gls{tmh}s (Figure~\ref{fig:complexity_datasets}B).
In ExpAll, 769~\gls{tmh}s from single\--pass proteins were simple~\gls{tmh}s and 570 were complex~\gls{tmh}s.
It is visually apparent (Figure~\ref{fig:complexity_datasets}) that there are (i) stronger skews and more inside-outside disparities in simple single\--pass~\gls{tm}s than in complex single\--pass~\gls{tm}s and (ii) greater similarities between single\--pass complex TM regions and those from multi\--pass proteins compared with simple single\--pass~\gls{tm}s in comparison with either of the other two distributions.
To examine the statistical significance of these observations, we compared the amino acid distributions (K, R, K+R, D, E, D+E, Y, W, L, C) across the range of~\gls{tmh}s with flank lengths $\pm$10 residues using the~\gls{ks},~\gls{kw} and the \({\chi}^{2}\) statistical tests.
To note, the~\gls{ks} test scrutinises for significant maximal absolute differences between distribution curves; the gls{kw} test is after skews between distributions and the \({\chi}^{2}\) statistical test checks the average difference between distributions.
Calculations were carried out over single\--pass complex, single\--pass simple and multi\--pass~\gls{tmh} datasets from both ExpAll and UniHuman (for p\--values and Bahadur slopes, Table~\ref{table:unihumanbahadur} (dataset UniHuman) and Table~\ref{table:expallbahadur} (dataset ExpAll)).
There is also a visual difference between simple single\--pass proteins, complex single\--pass proteins, and multipass proteins with regard to uncharged polar amino acids (serine, asparagine, glutamine, threonine, tyrosine, and cysteine), with complex single\--pass \gls{tmh}s being between multipass \gls{tmh}s and simple single\--pass \gls{tmh}s in terms of relative percentage profile across the membrane (Figure~\ref{fig:complexity_datasets}M and Figure~\ref{fig:complexity_datasets}N).
TMSOC uses hydrophobicity as part of the scrutinisation between simple and complex \gls{tmh}s, so it is not surprising that there are differences in polar residues between simple and complex \gls{tmh}s.
However, it is interesting to note that the reduction in polar residues is not reduced through the \gls{tmh} and flanks of simple and complex \gls{tmh}s evenly; simple \gls{tmh}s have less uncharged polar residues in the core of the \gls{tmh} than the complex \gls{tmh}s relative to the flanking areas.
Because there was no observable inside\--outside flank skews in the distributions, no further statistical analysis was carried out on this set.
\begin{table}[htbp]
\centering
\captionof{table}[Simple transmembrane helices are less similar than complex transmembrane helices to transmembrane helices from multi\--pass proteins in UniHuman.]{\textbf{Simple transmembrane helices are less similar than complex transmembrane helices to transmembrane helices from multi\--pass proteins in UniHuman.}
The statistical results were gathered by comparing complex single\--pass TMHs, simple TMHs from single\--pass proteins and TMHs from multi\--pass proteins in UniHuman.
The abundance of different residues at each position when using the centrally aligned TMH approach was compared with several statistical tests (the~\gls{ks},~\gls{kw} and the $\chi^2$ statistical tests) and the Bahadur slope values of those results}
\resizebox{\textwidth}{!}{
\tiny
\begin{tabular}{ p{5em} l l l l l l }
\toprule
\multirow{2}[4]{*}{Residues} & \multicolumn{3}{p{15em} }{p\--values for $\chi^2$} & \multicolumn{3}{p{15em} }{Bahadur slopes for $\chi^2$} \\
\cmidrule{2-7} \multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} \\
\midrule
R & 3.20E-06 & 7.38E-02 & 1.24E-01 & 6.61E-03 & 2.20E-03 & 1.27E-04 \\
\midrule
K & 2.23E-03 & 4.99E-02 & 2.14E-01 & 3.99E-03 & 3.70E-03 & 1.18E-04 \\
\midrule
D & 1.67E-09 & 3.06E-01 & 3.02E-01 & 3.34E-02 & 3.24E-03 & 1.20E-04 \\
\midrule
E & 3.80E-07 & 2.34E-01 & 2.31E-01 & 1.81E-02 & 3.05E-03 & 1.36E-04 \\
\midrule
Y & 3.86E-01 & 3.97E-01 & 2.11E-01 & 1.06E-03 & 1.47E-03 & 8.25E-05 \\
\midrule
W & 3.77E-03 & 2.97E-01 & 3.84E-01 & 8.52E-03 & 2.73E-03 & 1.13E-04 \\
\midrule
L & 3.59E-01 & 2.88E-01 & 3.21E-01 & 1.52E-04 & 3.92E-04 & 1.69E-05 \\
\midrule
C & 6.44E-01 & 3.97E-01 & 3.41E-01 & 4.29E-04 & 1.29E-03 & 8.57E-05 \\
\midrule
R+K & 2.19E-02 & 2.83E-01 & 2.52E-01 & 1.11E-03 & 6.33E-04 & 4.68E-05 \\
\midrule
D+E & 1.47E-03 & 2.86E-01 & 2.79E-01 & 4.59E-03 & 1.49E-03 & 6.15E-05 \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{3}{p{15em} }{p\--values for Kolmogorov-Smirnov} & \multicolumn{3}{p{15em} }{Bahadur slopes for Kolmogorov-Smirnov} \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} \\
\midrule
R & 2.31E-01 & 3.57E-04 & 1.08E-02 & 7.66E-04 & 6.71E-03 & 2.76E-04 \\
\midrule
K & 4.31E-02 & 2.18E-03 & 8.93E-01 & 2.06E-03 & 7.56E-03 & 8.68E-06 \\
\midrule
D & 1.39E-01 & 5.02E-06 & 1.08E-02 & 3.26E-03 & 3.34E-02 & 4.52E-04 \\
\midrule
E & 7.96E-02 & 1.58E-05 & 1.08E-02 & 3.10E-03 & 2.32E-02 & 4.20E-04 \\
\midrule
Y & 7.96E-02 & 2.22E-02 & 2.31E-01 & 2.81E-03 & 6.07E-03 & 7.78E-05 \\
\midrule
W & 2.31E-01 & 9.06E-04 & 4.31E-02 & 2.24E-03 & 1.58E-02 & 3.70E-04 \\
\midrule
L & 2.31E-01 & 2.31E-01 & 5.31E-01 & 2.17E-04 & 4.61E-04 & 9.42E-06 \\
\midrule
C & 1.39E-01 & 3.61E-01 & 3.61E-01 & 1.93E-03 & 1.42E-03 & 8.10E-05 \\
\midrule
R+K & 7.96E-02 & 1.33E-04 & 7.96E-02 & 7.35E-04 & 4.48E-03 & 8.60E-05 \\
\midrule
D+E & 4.31E-02 & 1.58E-05 & 4.98E-03 & 2.21E-03 & 1.31E-02 & 2.55E-04 \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{3}{p{15em} }{p\--values for Kruskal-Wallis} & \multicolumn{3}{p{15em} }{Bahadur slopes for Kruskal-Wallis} \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} \\
\midrule
R & 2.19E-01 & 5.06E-02 & 2.37E-01 & 7.92E-04 & 2.52E-03 & 8.79E-05 \\
\midrule
K & 2.90E-01 & 1.33E-01 & 7.00E-01 & 8.11E-04 & 2.49E-03 & 2.73E-05 \\
\midrule
D & 3.50E-01 & 1.81E-02 & 2.81E-01 & 1.74E-03 & 1.10E-02 & 1.27E-04 \\
\midrule
E & 2.59E-01 & 5.65E-02 & 1.78E-01 & 1.65E-03 & 6.04E-03 & 1.60E-04 \\
\midrule
Y & 6.03E-01 & 4.53E-01 & 4.41E-01 & 5.62E-04 & 1.26E-03 & 4.34E-05 \\
\midrule
W & 4.19E-01 & 1.84E-01 & 5.70E-01 & 1.33E-03 & 3.81E-03 & 6.62E-05 \\
\midrule
L & 6.37E-01 & 4.88E-01 & 9.77E-01 & 6.68E-05 & 2.25E-04 & 3.47E-07 \\
\midrule
C & 5.00E-01 & 2.22E-01 & 9.62E-01 & 6.76E-04 & 2.10E-03 & 3.11E-06 \\
\midrule
R+K & 1.87E-01 & 8.67E-02 & 4.08E-01 & 4.86E-04 & 1.23E-03 & 3.05E-05 \\
\midrule
D+E & 1.68E-01 & 4.52E-02 & 1.91E-01 & 1.25E-03 & 3.68E-03 & 7.97E-05 \\
\bottomrule
\end{tabular}%
}%
\label{table:unihumanbahadur}
\end{table}%
\begin{table}[htbp]
\centering
\captionof{table}[Simple transmembrane helices are less similar than complex transmembrane helices to transmembrane helices from multi\--pass proteins in ExpAll.]{\textbf{Simple transmembrane helices are less similar than complex transmembrane helices to transmembrane helices from multi\--pass proteins in ExpAll.}
As in Table~\ref{table:unihumanbahadur}, the statistical results were gathered by comparing complex single\--pass TMHs, simple TMHs from single\--pass proteins and TMHs from multi\--pass proteins; however, in this case only ExpAll is used.
The abundance of different residues at each position when using the centrally aligned TMH approach was compared with several statistical tests (the~\gls{ks},~\gls{kw} and the $\chi^2$ statistical tests) and the Bahadur slope values of those results}
\resizebox{\textwidth}{!}{
\tiny
\begin{tabular}{ p{5em} l l l l l l }
\toprule
\multirow{2}[4]{*}{Residues} & \multicolumn{3}{p{15em} }{p\--values for $\chi^2$} & \multicolumn{3}{p{15em} }{Bahadur slopes for $\chi^2$} \\
\cmidrule{2-7} \multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} \\
\midrule
R & 5.10E-06 & 2.98E-01 & 5.10E-06 & 9.17E-03 & 1.61E-03 & 6.23E-05 \\
\midrule
K & 2.35E-03 & 1.85E-01 & 2.35E-03 & 4.81E-03 & 3.88E-03 & 9.78E-05 \\
\midrule
D & 2.61E-08 & 1.84E-01 & 2.61E-08 & 4.15E-02 & 7.90E-03 & 1.41E-04 \\
\midrule
E & 2.38E-10 & 2.04E-01 & 2.38E-10 & 3.88E-02 & 7.08E-03 & 1.22E-04 \\
\midrule
Y & 3.03E-01 & 3.11E-01 & 3.03E-01 & 2.01E-03 & 2.49E-03 & 5.51E-05 \\
\midrule
W & 4.21E-03 & 4.29E-01 & 4.21E-03 & 1.11E-02 & 4.76E-03 & 6.46E-05 \\
\midrule
L & 3.79E-01 & 3.04E-01 & 3.79E-01 & 2.28E-04 & 4.66E-04 & 1.50E-05 \\
\midrule
C & 3.87E-01 & 2.52E-01 & 3.87E-01 & 1.75E-03 & 3.28E-03 & 1.48E-04 \\
\midrule
R+K & 7.16E-04 & 2.52E-01 & 7.16E-04 & 2.80E-03 & 1.28E-03 & 3.76E-05 \\
\midrule
D+E & 3.58E-05 & 2.94E-01 & 3.58E-05 & 1.03E-02 & 1.94E-03 & 4.90E-05 \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{3}{p{15em} }{p\--values for Kolmogorov-Smirnov} & \multicolumn{3}{p{15em} }{Bahadur slopes for Kolmogorov-Smirnov} \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} \\
\midrule
R & 3.61E-01 & 4.31E-02 & 3.61E-01 & 7.66E-04 & 7.79E-03 & 1.62E-04 \\
\midrule
K & 4.31E-02 & 8.93E-01 & 4.31E-02 & 2.49E-03 & 1.05E-02 & 6.57E-06 \\
\midrule
D & 1.39E-01 & 2.18E-03 & 1.39E-01 & 4.68E-03 & 3.61E-02 & 5.10E-04 \\
\midrule
E & 5.31E-01 & 1.33E-04 & 5.31E-01 & 1.11E-03 & 2.81E-02 & 6.87E-04 \\
\midrule
Y & 2.31E-01 & 9.06E-04 & 2.31E-01 & 2.47E-03 & 6.26E-03 & 3.30E-04 \\
\midrule
W & 5.31E-01 & 4.98E-03 & 5.31E-01 & 1.29E-03 & 1.13E-02 & 4.04E-04 \\
\midrule
L & 2.31E-01 & 2.31E-01 & 2.31E-01 & 3.45E-04 & 2.12E-03 & 1.85E-05 \\
\midrule
C & 5.31E-01 & 3.61E-01 & 5.31E-01 & 1.16E-03 & 8.91E-04 & 1.09E-04 \\
\midrule
R+K & 1.39E-01 & 2.31E-01 & 1.39E-01 & 7.61E-04 & 4.82E-03 & 4.00E-05 \\
\midrule
D+E & 1.39E-01 & 9.06E-04 & 1.39E-01 & 1.99E-03 & 1.41E-02 & 2.80E-04 \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{3}{p{15em} }{p\--values for Kruskal-Wallis} & \multicolumn{3}{p{15em} }{Bahadur slopes for Kruskal-Wallis} \\
\midrule
\multicolumn{1}{ l }{} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} & \multicolumn{1}{p{5em} }{Simple-vs-complex} & \multicolumn{1}{p{5em} }{Simple-vs-multi} & \multicolumn{1}{p{5em} }{Complex-vs-multi} \\
\midrule
R & 4.37E-01 & 3.92E-01 & 4.37E-01 & 6.24E-04 & 2.52E-03 & 4.82E-05 \\
\midrule
K & 3.83E-01 & 6.93E-01 & 3.83E-01 & 7.62E-04 & 2.88E-03 & 2.13E-05 \\
\midrule
D & 4.49E-01 & 1.81E-01 & 4.49E-01 & 1.90E-03 & 1.06E-02 & 1.42E-04 \\
\midrule
E & 7.64E-01 & 1.94E-01 & 7.64E-01 & 4.71E-04 & 9.05E-03 & 1.26E-04 \\
\midrule
Y & 8.32E-01 & 3.36E-01 & 8.32E-01 & 3.09E-04 & 9.63E-04 & 5.15E-05 \\
\midrule
W & 7.25E-01 & 1.36E-01 & 7.25E-01 & 6.53E-04 & 5.44E-03 & 1.52E-04 \\
\midrule
L & 7.15E-01 & 7.95E-01 & 7.15E-01 & 7.90E-05 & 3.41E-04 & 2.90E-06 \\
\midrule
C & 8.47E-01 & 9.54E-01 & 8.47E-01 & 3.05E-04 & 4.26E-05 & 5.06E-06 \\
\midrule
R + K & 2.89E-01 & 5.13E-01 & 2.89E-01 & 4.79E-04 & 1.41E-03 & 1.82E-05 \\
\midrule
D+E & 4.94E-01 & 2.07E-01 & 4.94E-01 & 7.11E-04 & 4.14E-03 & 6.29E-05 \\
\bottomrule
\end{tabular}%
}%
\label{table:expallbahadur}
\end{table}%
Many low p\--values in Tables~\ref{table:unihumanbahadur} and~\ref{table:expallbahadur} indicate significant differences between the three distributions studied.
For the UniHuman dataset (Table~\ref{table:unihumanbahadur}), we find most striking, significant differences between charged residue distributions (R, K, D, E) of simple and complex single\--pass~\gls{tmh}+flank regions (\({\chi}^{2}\) p\--value$<$2.23e-3 for single amino acid types).
Similarly, simple single\--pass~\gls{tmh}+flank segments differ significantly from multi\--pass~\gls{tmh}+flank segments (\gls{kw} test p\--values$<$3.e-2 for R, K, D, E, Y, W amino acid types as well as for K+R and D+E).
The trends are the same for the ExpAll dataset (Table~\ref{table:expallbahadur}): simple and complex single\--pass~\gls{tmh}+flank regions differ in charged amino acid type distributions (\({\chi}^{2}\) p\--value$<$4.21e-3 for all cases), as well as simple single\--pass and multi\--pass ones, do (\gls{kw} test p\--values$<$5.e-2 for R, D, E, Y, W amino acid types and D+E).
Whereas p\--value tests for significant differences between distributions depend strongly on the amount of data, the more informative Bahadur slopes that measure the distance from the zero hypothesis are independent of the amount of data~\cite{Bahadur1967, Bahadur1971, Sunyaev1998}.
As we can see in Tables~\ref{table:unihumanbahadur} and~\ref{table:expallbahadur}, the absolute Bahadur slopes for the simple single\--pass to multi\--pass comparison are always larger (even by at least an order of magnitude): (ii) for all three statistical tests applied (\({\chi}^{2}\),~\gls{ks} and~\gls{kw}), (ii) for all amino acid types, for K+R and E+D and (iii) for both datasets UniHuman and ExpAll.
Thus, complex single\--pass~\gls{tmh}+flanks have compositional properties that are indeed very similar to those of multi\--pass ones (which are known to have a large fraction of complex~\gls{tmh}s~\cite{Wong2011, Wong2012}).
This strong evidence implies that the actual issue is not so much about single- and multi\--pass~\gls{tmh} segments but between simple and complex~\gls{tmh}s where the first are exclusively guided by the anchor requirements whereas the latter have more complex restraints to fulfil.
Several distribution features of simple~\gls{tmh}s from single\--pass proteins when compared to complex~\gls{tmh}s from single\--pass proteins and~\gls{tmh}s from multi\--pass proteins that contribute to the statistical differences (Figure~\ref{fig:complexity_datasets}) are especially notable.
There is a more pronounced trend for positively\--charged residues and tyrosine to be preferentially located on the inside flanks and for negatively\--charged residues to be on the outside flanks.
The symmetrical peaks in the percentage distribution of tyrosine in complex single\--pass~\gls{tmh}s are more akin to multi\--pass~\gls{tmh}s, whereas in simple~\gls{tmh}s the distribution resembles a more typical single\--pass helix (compare with Figure~\ref{fig:single_pass_charge_distribution}).
Furthermore, the depression of charged residues within the~\gls{tmh} itself is strongest in simple single\--pass~\gls{tmh}s.
To emphasise, tryptophan is essentially not tolerated within the simple~\gls{tmh}s and there are higher peaks of tryptophan occurrence at either flank.
We also see a strong inside skew for leucine clustering within the core of simple~\gls{tmh}s which is not present in the ``flatter'' distributions of complex single\--pass~\gls{tmh}s and~\gls{tmh}s from multi\--pass proteins.
There is obviously a cysteine-inside preference for simple, single\--pass~\gls{tmh}s but less in complex, multi\--pass~\gls{tmh}s (Figure~\ref{fig:complexity_datasets}).
This conclusion is contrary to a previous study~\cite{Nakashima1992} but that deduction was drawn from a much smaller dataset of 45 single\--pass~\gls{tmh}s and 24 multi\--pass~\gls{tmp}s.
\section{Discussion}
The ``negative-outside/non-negative inside'' skew in~\gls{tmh}s and their flanks is statistically significant
We have seen that, consistently throughout the datasets, there is a trend for generally rare negatively\--charged residues to prefer the outside flank of a~\gls{tmh} rather than the inside (and to almost completely avoid the~\gls{tmh} itself); be it by suppression on the inside and/or enrichment on the outside.
The trend is much stronger in single\--pass protein datasets than in multi\--pass protein datasets.
However as we elaborated on further, the real crux of the bias appears to be associated with the~\gls{tmh} being simple or complex~\cite{Wong2011, Wong2012}, thus, whether or not the~\gls{tmh} has a role beyond anchorage.
The existence of this bias has implications for topology prediction of proteins with~\gls{tmh}s, engineering membrane proteins as well as for models of protein transport via membranes and protein-membrane stability considerations.
It should be noted that the controversy in the scientific community about the existence of a negative\--charge bias at~\gls{tmh}s was mainly with regard to multi\--pass~\gls{tmp}s.
Despite having access to much larger, better annotated sequence datasets and many more 3D structures than our predecessors, we also had our share of difficulties here (see Results section \ref{section:negativeskewmultipass} and Table \ref{table:multipassstats}).
The straightforward approach results in inconclusive statistical tests if datasets become small (for example, if selections are restricted to subcellular localisations, 3D structures or if very harsh sequence redundancy criteria are applied) and, especially, if~\gls{tmh}s with very short or no flanks are included.
Therefore in the case of multi\--pass proteins, we studied flanks as taken from the TM boundaries in the databases under several conditions: (i) without allowing flank overlap between neighbouring~\gls{tmh}s, (ii) as subset of (i) but with requiring some minimal flank length at either side, (iii) with overlapping flanks.
We also studied flanks after central alignment of~\gls{tmh}s and assuming standardised~\gls{tmh} length.
multi\--pass~\gls{tmh}s (without overlapping flanks) do not show statistically significant negative\--charge bias under condition (i) but, apparently, due to many~\gls{tmh}s without any or super-short flanks at least at one side.
Significance appears as soon as subsets of~\gls{tmh}s with flanks at both sides are studied.
Not surprisingly, there is no charge bias if there are no flanks in the first place.
It is perhaps worth noting that the results from multi\--pass~\gls{tmh}s with overlapping flanks may involve amplification of skews since it involves multiple counting of the same residues.
Given the redundancy threshold of UniRef90, we cannot rule out that these statistical skews are the result of a trend from only a small sub-group of~\gls{tmp}s which is being amplified.
Hence, we also needed to observe if these same observed biases were true in condition (ii), which is indeed the case.
As the ``negative-outside/negative-not-inside'' skew is widely observed among varying taxa and subcellular localisations with statistical significance, it appears to, at least to a certain extent, be caused by physical reasons and be associated with the background membrane potential.
Several earlier considerations and observation support this thought: (i) Firstly, a concert between the negative and positive charge on the~\gls{tmh} flanks drives anchorage and the direction of insertion of engineered~\gls{tmh}s~\cite{Sipos1993, Hartmann1989}.
(ii) The inner leaflet of the plasmalemma tends to be more negatively\--charged~\cite{Zachowski1993}.
Specifically, phosphatidylserine was found to distribute in the cytosolic leaflets of the plasma membrane and it was found to electrostatically interact with moderately positive-charged proteins enough to redirect the proteins into the endocytic pathway~\cite{Yeung2008}.
The negative charge of proteins at the inside of the plasma-membrane would decrease the anchoring potency of the~\gls{tmh} via electrostatic repulsion.
(iii) Thirdly in membranes that maintain a membrane potential, there are inevitably electrical forces acting on charged residues during chain translocation as this influences the translocon machinery when orienting the~\gls{tmh}.
Therefore, it is no surprise that we see an inside-outside bias for negatively\--charged residues that is opposite to the one for positively\--charged residues.
The negative charges in~\gls{tmh} residues have been shown to experience an electrical pulling force as they pass through the bacterial SecYEG translocon import~\cite{Ismail2012, Ismail2015}.
Also, they are known to be involved in intra-membrane helix-helix interactions~\cite{Meindl-Beinker2006}.
For example, aspartic acid and glutamic acid can drive efficient di- or trimerisation of~\gls{tmh}s in lipid bilayers and, furthermore, that aspartic acid interactions with neighbouring~\gls{tmh}s can directly increase insertion efficiency of marginally hydrophobic~\gls{tmh}s via the Sec61 translocon~\cite{Meindl-Beinker2006}.
In support of this, less acidic residues are found in single\--pass~\gls{tmh}s, among which only some will undergo intra-membrane helix-helix interactions.
As the mutation studies have shown negative charge as a topological determinant~\cite{Nilsson1990}, therefore, it is perhaps no surprise that we observe a skew in negatively\--charged residues in a similar manner to the skew in positively\--charged residues.
Whereas the ``negative-outside/negative-not-inside'' skew is observed for distantly related eukaryotic species and it is also present in Gram-negative bacteria such as \textit{E.
coli}, this sequence pattern was not observed for the Gram-positive bacteria in which there is no observable bias.
In contrast, Archaea have a statistically significant ``negative-inside'' propensity both for single- and multi\--pass~\gls{tmp}s.
It is known that Archaea have remarkably different membranes compared to other kingdoms of life due to their extremophile adaptations to stress~\cite{Oger2013}.
Whilst it is unclear why negative charge is distributed so differently in UniArch to the other taxonomic datasets, one must appreciate that a much more nuanced approach would be needed to draw formal conclusions about Archaea, which current databases cannot provide due to the relatively limited information and annotation of Archaean proteomes.
Methodological issues made previous studies struggle to identify negatively\--charged skews with statistical significance
Whereas the influence of a negative\--charge bias in engineered proteins with TM regions on the direction of insertion into the membrane was solidly established~\cite{Nilsson1990, Andersson1993, Kim1994, Andersson1992, Rutz1999}, the search for the negative charge distribution pattern in the statistics of sequences of TM proteins from databases failed to find significance for the expected negative charge skew~\cite{Sharpe2010, Baeza-Delgado2013, Granseth2005, Pogozheva2013, Nilsson2005a, Andersson1992}.
Generally speaking, the datasets from previous studies have been considerably smaller compared with those in our work (only Sharpe \textit{et al.} had a similar order of magnitude~\cite{Sharpe2010}), especially those with experimental information about 3D structure and membrane topology that we used for validation.
And they might not have had the luxury of using UniProt’s improved TRANSMEM consensus annotation based on a multitude of TM prediction methods and experimental data, but this is also not the major issue.
We found that there are other factors that are critical for observing sequence bias such as negative charge skew in the case of~\gls{tmh}s.
\begin{enumerate}[i]
\item Acidic residues are rare near and within~\gls{tmh} and biases in their distribution are easily blurred by minor fluctuations of much more frequent amino acid types, most notably leucine.
Therefore, the method of normalisation is critical.
We have shown that normalising by the total amount of residues of the amino acid type studied within the sequence region under consideration is appropriate to answer the question where to find a negatively\--charged residue if there is any at all (called ``relative percentage'' in this work).
\item The alignment of the~\gls{tmh}s is critical.
It was common practice to align~\gls{tmh} according to the most cytosolic residue~\cite{Sharpe2010} although it is known that the membrane/cytosol boundary of the~\gls{tmh} is not well defined (and the exact boundary is even less well understood at the non-cytosolic side).
Aligning the TM regions and their flanks from the center of the~\gls{tmh} was first proposed by Baeza-Delgado \textit{et al.}~\cite{Baeza-Delgado2013}.
Since we know now that acidic residues are often suppressed in the cytosolic flank and within the~\gls{tmh}, this implies that the few acidic residues found in the cytosolic interface would appear more comparable to those in the poorly defined non-cytosolic interface as the respective residues are spread over more potential positions, diminishing any observable bias.
\item We find that separation into single- and multi\--pass~\gls{tm} datasets (or, even better, simple and complex~\gls{tmh}s~\cite{Wong2011, Wong2012}) is critical to study the inside/outside bias.
As many~\gls{tmh}s in multi\--pass~\gls{tmp}s have essentially no flanks or very short flanks if the condition of non-overlap is applied to flanks of neighbouring~\gls{tmh}s, this might also obscure the observation of the negative\--charge bias.
If there are no flanks, then there will be no residue distribution bias in these flanks.
The problem can be alleviated by either studying only subsets with minimal flank lengths on both sides (although datasets might become too small for statistical analysis) or by allowing flank overlaps between neighbouring~\gls{tmh}s.
\item This classification is even more justified in the light of previous reports about the ``missing hydrophobicity'' in multi\--pass~\gls{tmh}s~\cite{Nilsson1990, Hedin2010, Hessa2007, Ojemalm2012}.
Otherwise, the distribution bias well observed among the exclusive anchors could be lost to noise.
This addresses the more biologically contextualised issue that there are different evolutionary pressures on different types of~\gls{tmh}s.
The negative charge skew is most pronounced for dedicated anchors frequently found with simple~\gls{tmh}s typically observed in single\--pass TM proteins.
These~\gls{tmh}s are pressured to exhibit residue biases that may aid anchorage in a topologically correct manner.
Complex~\gls{tmh}s, typically within multi\--pass membrane proteins that have a function beyond anchorage, comply with a multitude of restraints structural and functional constraints and the negative charge skew is just one of them.
\end{enumerate}
The most representative precedent papers are those of Sharpe \textit{et al.}~\cite{Sharpe2010} from 2010 (with 1192 human and 1119 yeast single\--pass~\gls{tmh}s), Baeza-Delgado \textit{et al.}~\cite{Baeza-Delgado2013} (with 792~\gls{tmh}s mixed from single- and multi\--pass~\gls{tmp}s) and Pogozheva \textit{et al.}~\cite{Pogozheva2013} (\gls{tmh}s from 191 mixed from single- and multi\--pass~\gls{tmp}s with structural information) both from 2013.
Whereas the first analysis would have benefited from the central alignment approach and the first two studies from another normalisation as described above, the third study did come close to our findings.
To note, their dataset mixed with single- and multi\--pass proteins was too small for revealing the negative\--charge bias with significance; yet, they observed total charge differences at either sides of the membrane varying for both single- and multi\--pass proteins.
Membrane asymmetry due to positively\--charged residues occurring more frequently on the cytosolic side causes net charge unevenness at both sides of the membrane.
This observation has been known to correlate with orientation for decades~\cite{VonHeijne1989, Baeza-Delgado2013, Meindl-Beinker2006}.
Our data shows that the negative charge skew contributes to this asymmetry.
There are differences in charged amino acid residue biases in~\gls{tmh} flanks through each stage of the secretory pathway
Here, we observe differences throughout sub-cellular locations along the secretory pathway.
We found that negative charges are enriched at the outside flank (in the~\gls{er}), both enriched outside and suppressed inside for the Golgi membrane, and suppressed on the inside flank in the~\gls{pm}.
It has been suggested that the leaflets of different membranes have different lipid compositions throughout the secretory pathway~\cite{VanMeer2008} and this has led to general biochemical conservation in terms of~\gls{tmh} length and amino acid composition in different membranes~\cite{Sharpe2010, Pogozheva2013}.
However, herein only the organelles with the most protein record annotation were used.
Further investigation into the \gls{tmp}s of lysosomes, endosomes, and other \gls{er}\--Golgi transition vesicles would yield more information on this.
Furthermore, there could be study into \gls{tmp}s not associated with the signal peptide \gls{tmp}s destined for the secretory pathway such as those \gls{tmp}s embedded in the membranes of the mitochondrion, apicoplast, chromoplast, chloroplast, cyanelle, thylakoid, amyloplast, peroxisome, glyoxysome, and hydrogenosome.
Lipid asymmetry in the Golgi and~\gls{pm} (in contrast to the~\gls{er}) has been known about for over a decade~\cite{Daleke2007, Devaux2004}.
To note, the Golgi and~\gls{pm} have lipid asymmetry with sphingomyelin and glycosphingolipids on the non-cytosolic leaflet, and phosphatidylserine and phosphatidylethanolamine enriched in the cytosolic leaflet.
Although the~\gls{er} is the main site for cholesterol synthesis, it has markedly low concentrations of sphingolipids~\cite{Bell1981}.
Golgi synthesises sphingomyelin, a lipid not present in the~\gls{er}, but present in both the Golgi~\cite{Futerman2005} and in the~\gls{pm}~\cite{Li2007, Tafesse2007}.
The~\gls{pm} is also enriched with densely packed sphingolipids and sterols~\cite{Paolo2006}.
Another factor influencing the sequence patterns of~\gls{tmh}s and their along the secretory pathway appears to be the variation in membrane potentials~\cite{Qin2011, Worley1994, Schapiro2000}.
Several sequence features can be assigned to anchor~\gls{tmh}s: Charged-residue flank biases, leucine intra-helix asymmetry, and the ``aromatic belt''.
We investigated the difference between~\gls{tmh}s from single\--pass and multi\--pass proteins and found significant differences in sequence composition that are reflective of the biologically different roles the~\gls{tmh}s play.
To emphasise and validate these findings, we separated~\gls{tmh}s from single\--pass proteins into simple and complex~\gls{tmh}s~\cite{Wong2011, Wong2012}; ones that likely contains mostly~\gls{tmh}s that act as exclusive anchors, and another that have roles beyond anchorage.
This leaves us with ``anchors'' (simple~\gls{tmh}s from single\--pass proteins) and ``non-anchors'' (complex~\gls{tmh}s from single\--pass proteins, and~\gls{tmh}s from multi\--pass proteins).
If there are strong sequence feature differences between anchors and non-anchors, it is likely that the sequence feature has a role in satisfying membrane constraints to act as an energetically optimally stable anchor.
Future studies in the area would desirably directly include a comprehensive analyses of datasets oligomerised~\gls{tmh}s from single\--pass proteins and ascertain if they appear to be more similar to simple anchors, multi\--pass, or generally neither.
Currently, no sufficiently complete set of intra-membrane oligomerised single\--pass proteins exists that can be compared to a large set of known non-oligomerising proteins.
The current work sidesteps this issue by comparing single\--pass proteins with simple~\gls{tmh}s, which tend to be simple anchors (as shown in previous work~\cite{Wong2011, Wong2012}), against datasets that contain~\gls{tmh}s that will form intra-membrane bundles.
Bluntly, the simple/complex status of a~\gls{tmh} can be easily computed from its sequence with TMSOC whereas the oligomerisation state of most membrane proteins still needs to be experimentally determined.
Unsurprisingly, both positively and negatively\--charged residues can be seen to be more strongly distributed with bias in anchors than non-anchors.
Both the ``positive-inside'' rule as well as the ``negative-outside/non-negative-inside'' bias are mostly observable in simple single\--pass~\gls{tmh}s (although they are statistically significant elsewhere).
It is perhaps true that where a bias is clearly present in both non-anchors and anchors alike, it is a strong topological determinant, whereas if the residue is only distributed with topological bias in exclusively anchoring~\gls{tmh}s, we can attribute these features more specifically to biophysical anchorage.
This being said, we should not rule out that the same features aid topological determination since negative charge has been shown to be a weaker topological determinant than positively\--charged residues (35).
Tyrosine and tryptophan residues commonly are found at the interfacial boundaries of the~\gls{tmh} and this feature is called the ``aromatic belt''~\cite{Sharpe2010, Baeza-Delgado2013, Granseth2005, Nilsson2005a, Hessa2005} and this was thought to be caused by their affinity to the carbonyl groups in the lipid bilayer~\cite{Killian2000}.
Not all types of aromatic residues are found in the aromatic belt; phenylalanine has no particular preference for this region~\cite{Granseth2005, Braun1999}.
It is still unclear if the aromatic belt has to do with anchorage or with translocon recognition~\cite{Baeza-Delgado2013}.
Here,~\gls{tmh}s with exclusively anchorage functions showed stronger preferences for the W and Y in the aromatic belt region, otherwise known as the water-lipid interface region than~\gls{tmh}s with function beyond anchorage.
This is strong evidence that the aromatic belt indeed assists with anchorage, and is less conserved where the~\gls{tmh} must conform to other restraints beyond membrane anchorage.
Furthermore, we see that the tyrosine's preference for the inside interface region also appears to be to do with anchorage and this trend is somewhat true for tryptophan, too.
Finally, our findings corroborate earlier reports that many multi\--pass~\gls{tmh}s are much less hydrophobic than typical single\--pass~\gls{tmh} and about 30\% of them fail the hydrophobicity requirements of $\Delta$G~\gls{tmh} insertion prediction (``missing hydrophobicity'')~\cite{Hessa2005, Hedin2010, Hessa2007, Ojemalm2012}.
We also find that the leucine skew and the hydrophobic asymmetry towards the cytosolic leaflet of the membrane is more pronounced in simple, single\--pass~\gls{tmh}s than in complex or multi\--pass ones; thus, it appears to be another anchoring feature.
It was found previously that the hydrophobic profiles of~\gls{tmh}s of multi\--pass proteins share similar hydrophobicity profiles on average irrespective of the number of~\gls{tmh}s and~\gls{tmh}s from single\--pass proteins have been found to be typically more hydrophobic than~\gls{tmh}s from multi\--pass proteins~\cite{Wong2011}.
Sharpe \textit{et al.}~\cite{Sharpe2010} report an asymmetric hydrophobic length for single\--pass~\gls{tmh}s.
Our study reiterates the hydrophobic asymmetry and attributes it mainly to the leucine distribution.
The leucine asymmetry might be linked to the different lipid composition of either leaflet of biological membranes.
\begin{figure}[!ht]
\centering
\includegraphics[width=1\textwidth]{NNI_chapter/overview}
\captionof{figure}[Residue distributions of transmembrane anchors.
A view showing additional residue distribution features that transmembrane helices with an anchorage function display.]{\textbf{Residue distributions of transmembrane anchors.