-
Notifications
You must be signed in to change notification settings - Fork 0
/
_main.Rmd
3452 lines (2650 loc) · 247 KB
/
_main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Neotoma Paleoecology Manual v2.0"
description: "The manual for the Neotoma Paleoecology Database, including SQL and R code examples."
author: "Simon Goring"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
documentclass: book
bibliography: [assets/references.bib]
biblio-style: apalike
link-citations: yes
delete_merged_file: true
always_allow_html: true
graphics: yes
url: https://neotomadb.org
output:
bookdown::gitbook:
css: ./style.css
config:
toc:
max-depth: 4
edit: null
download: null
bookdown::pdf_book:
latex_engine: xelatex
citation_package: natbib
keep_tex: yes
bookdown::epub_book: default
---
# Acknowledgements
The documentation for the [Neotoma Paleoecology Database](https://neotomadb.org) would not be possible without the extrordinary work of Dr. Eric C. Grimm [@Jacobson2021] who spent countless hours developing the original database manual, and was the center of an incredible community built around the database. Neotoma rests on the work of a number of researchers who contributed to the original North American Pollen Database, and subsequent data contributors, including FAUNMAP contributors and the data contributions of Allan Ashworth. The Neotoma Database would not exist were it not for the ongoing contributions of authors, data analysts and funding agencies, in particular the National Sciences Foundation. This manual draws heavily from Eric Grimm's original Neotoma manual (v2), published as [@grimm2008neotoma].
The Postgres snapshot of the database is accessible from [the Neotoma Snapshots page](http://www.neotomadb.org/snapshots). For users who may be interested in loading the database using Docker, a GitHub repository is available to install the latest snapshot and build a container locally.
```{r, packageload, echo=FALSE, messages=FALSE, warnings='false', results='hide', include=FALSE}
Sys.setenv(OPENSSL_CONF="/dev/null")
if (!require("pacman")) install.packages("pacman")
suppressPackageStartupMessages(library(pacman))
packages <- read.delim('requirements.txt')
p_load(char = unlist(packages))
dbLogin <- fromJSON('connect_remote.json')
db <- DBI::dbConnect(RPostgres::Postgres(),
dbname = dbLogin$database,
host = dbLogin$host,
port = dbLogin$port,
user= dbLogin$user,
password= dbLogin$password)
showTable <- function(db, tablename) {
query <- "SELECT
atr.attname AS column,
pg_catalog.format_type(atr.atttypid, atr.atttypmod) AS data_type,
array_to_string(array_remove(array_agg(DISTINCT con.conname), NULL), ', ') AS constraint
FROM
pg_catalog.pg_attribute AS atr
INNER JOIN
pg_catalog.pg_class AS cl ON cl.oid = atr.attrelid
INNER JOIN
pg_catalog.pg_namespace AS nsp ON nsp.oid = cl.relnamespace
LEFT JOIN pg_catalog.pg_constraint AS con ON con.conrelid = atr.attrelid AND atr.attnum = ANY(con.conkey)
WHERE
atr.attnum > 0
AND NOT atr.attisdropped
AND nsp.nspname = 'ndb'
AND cl.relname = $1
GROUP BY atr.attname, atr.atttypid, atr.atttypmod, atr.attnum
ORDER BY atr.attnum;"
output <- dbGetQuery(db, query, tablename)
DT::datatable(output, filter = 'none', rownames = FALSE, options = list(dom='ltip'))
}
```
<!--chapter:end:index.Rmd-->
# Introduction
The Neotoma Paleoecology Database is a public, community curated database containing fossil data from the Holocene, Pleistocene, and Pliocene, or approximately the last 5.3 million years [@Williams2018a]. Neotoma stores biological data, and associated physical data from fossil bearing deposits or the depositional environments from which datasets have been obtained. For example, sediment loss-on-ignition and geochemical data from lake sediments, or modern water chemistry data from water bodies from which diatoms have been collected. The database also stores data from modern samples that are used to interpret fossil data.
The initial development of Neotoma was funded by a grant from the U.S. National Science Foundation Geoinformatics program. The inital grant was a collaborative proposal between Penn State University [@nsf0622349] and the Illinois State Museum [@nsf0622289]. It had five Principle Investigators, Russell W. Graham, Eric C. Grimm, Stephen T. Jackson, Allan C. Ashworth, and John W. (Jack) Williams.
Initially, data within Neotoma were merged from four existing databases: the Global Pollen Database, [FAUNMAP](\@ref(FAUNMAP)), a database of mammalian fauna [@Group1994], the North American Plant Macrofossil Database, and a [fossil beetle database](\@ref(BEETLE)) assembled by Allan Ashworth. Although structurally different, these databases contain similar kinds of data, and merging them was quite practical. The rationale for this merging was twofold:
1. To facilitate analyses of past biotic communities at the ecosystem level
2. To reduce the overhead in maintaining and distributing several independent databases.
Because the proxy types that Neotoma integrated were sufficiently diverse, the data model had to focus on the commonalities as the core of the database structure, specifically elements of stratigraphy and chronology. This design facilitated the gradual inclusion of other database types including the addition of ostracode, diatom, chironmid, and freshwater mussel datasets.
The Neotoma database was initially designed by Eric C. Grimm and implemented in Microsoft® Access®. Neotoma was ported to SQL Server, where it was served from the [Center for Environmental Informatics](https://sites.psu.edu/environmentalinformatics/) at Penn State University. Subsequently the database was ported to PostgreSQL, to support a fully open data ecosystem with an Application-Program Interface (API) that could be integrated into R packages [e.g., @goring2015neotoma] or other programming languages.
## Whence Neotoma
| ![Image of a packrat](assets/images/Neotoma_magister.jpg) |
|:--:|
| A packrat of the genus *Neotoma*. Credit: Alan Cressler, CC BY-SA 2.0 <https://creativecommons.org/licenses/by-sa/2.0>, via Wikimedia Commons |
Neotoma was called a "Late Neogene Terrestrial Ecosystem Database" in the [original NSF proposal](https://nsf.gov/awardsearch/showAward?AWD_ID=0622289). At the time this proposal was written, the Neogene Period included the Miocene, Pliocene, Pleistocene, and Holocene epochs. However, a proposal before the International Commission on Stratigraphy elevated the Quaternary to a System or Period following the Neogene [@gibbard2010formal], terminating the Neogene at the end of Pliocene. To account for the change in nomenclature, numerous names and companion acronyms were considered, but none engendered enthusiastic support. B. Brandon Curry proposed the name **Neotoma**, and this name struck a fancy. *Neotoma* is the [genus for the packrat](https://en.wikipedia.org/wiki/Pack_rat). Packrats are prodigious collectors of anything in their territory, and moreover they are collectors of fossil data. **Neotoma** packrats collect plant macrofossils and bones, and pollen is preserved in their amberat -- hardened, dried urine, which impregnates their middens and preserves them for millennia.
## Rationale
Paleobiological data from the recent geological past have been invaluable for understanding ecological dynamics at timescales inaccessible to direct observation, including ecosystem evolution, contemporary patterns of biodiversity, principles of ecosystem organization, particularly the individualistic response of species to environmental gradients, and the biotic response to climatic change, both gradual and abrupt. Understanding the dynamics of ecological systems requires ecological time series, but many ecological processes operate too slowly to be amenable to experimentation or direct observation. In addition to having ecological significance, fossil data have tremendous importance for climatology and global change research. Fossil floral and faunal data are crucial for climate-model verification and are essential for elucidating climate-vegetation interactions that may partly control climate.
Basic paleobiological research is site based, and paleobiologists have devoted innumerable hours to identifying, counting, and cataloging fossils from cores, sections, and excavations. These data are typically published in papers describing single sites or small numbers of sites. Often, the data are published graphically, as in a pollen diagram, and the actual data reside on the investigator's computer or in a file cabinet. These basic data are similar to museum collections, costly to replace, sometimes irreplaceable, and their value does not diminish with time. Also similar to museum collections, the data require cataloging and curation. Whereas physical specimens of large fossils, such as animal bones, are typically accessioned into museums, microfossils, such as pollen, are not accessioned, and the digital data are the primary objects, and their loss is equivalent to losing valuable museum specimens. The integrated database that we propose ensures safe, long-term archiving of these data.
Large independent databases exist for fossil pollen, plant macrofossils, and mammals: the Global Pollen Database (GPD), the North American Plant Macrofossil Database (NAPMD), and FAUNMAP. In addition, a database of fossil beetles (BEETLE) has been assembled and integrated into Neotoma. These databases, as with others in the Earth and ecosystem sciences, have become essential cyberinfrastructure. Nevertheless, these resources were originally developed as standalone databases in the early 1990's. GPD and NAPMD were stored in [Paradox®](https://en.wikipedia.org/wiki/Paradox_(database)) file formats; FAUNMAP in [Microsoft Access](https://en.wikipedia.org/wiki/Microsoft_Access). Since initial database development, emphasis has been placed on ingest of new and legacy data. However, database and Internet technology have advanced greatly since 1995, and the current relational database software, ingest programs, data retrieval algorithms, output formats, and analysis tools are outdated and minimal. Moreover, the databases are not linked, so that integrated analyses are difficult.
Although GPD, NAPMD, and FAUNMAP were developed independently, they have much in common. The basic data of all three databases as well as BEETLE are essentially lists of taxa from cores, excavations, or sections, often with quantitative measures of abundance. The three databases include similar metadata. The objective of Neotoma is to build a unified data structure that will incorporate all of these databases. The database will initially incorporate pollen, plant macrofossil, mammal, and beetle data. However, the database designed facilitates the incorporation of all kinds of fossil data.
Various teams of investigators have developed databases for paleobiological data that have been project or discipline based,
including the four databases to be integrated in this project. However, long-term maintenance and sustainability have been problematic because of the need to secure continuous funding. Nevertheless, these databases have become the established archives for their disciplines and, new data are continuously contributed. However, because of funding hiatuses, long spells may intervene between times of data contribution and their public availability. For example, the plant macrofossil database has not incorporated any new data since 1999. The number of different databases and disciplines exacerbates the problem, because each database requires a database manager. Consolidation of informatics technology helps address this overhead issue. However, specialists are still essential for management and supervision of data collection and quality control for their disciplines or organismal groups.
The purposes of Neotoma are:
* to facilitate studies of ecosystem development and response to climate change
* to provide the historical context for understanding biodiversity dynamics, including genetic diversity
* to provide the data for climate-model validation
* to provide a safe, long-term, low-cost archive for a wide variety of paleobiological data.
Site-based studies are invaluable in their own right, and they are the generators of new data. However, much is gained by marshalling data from geographic arrays of sites for synoptic, broad-scale ecosystem studies. In order to carry out such studies efficiently, a queryable database is required. Thus, it is much morethan an archive; it is essential cyberinfrastructure for paleoenvironmental research. The database facilitates integration, synthesis, and understanding, and it promotes information sharing and collaboration. The individual databases have been extensively used for scientific research, with several hundred scientific publications directly based upon data drawn from these databases. This project will enhance those databases and will continue their public access. By integrating these databases and by simplifying the contributor interface, we can reduce the number of people necessary for community-wide database maintenance, and thereby help ensure their long-term sustainability and existence.
## History of the Constituent Databases
### Global Pollen Database
In an early effort, the Cooperative Holocene Mapping Project [@Members1988;@Wright1993c] assembled pollen data in the 1970s and 1980s to test climate models. Although data-model comparison was the principal objective of the COHMAP project, the synoptic analyses of the pollen data, particularly maps showing the constantly shifting ranges of species in response to climate change, were revelatory and led to much ecological insight [e.g. @Webb1981;@Webb1987a;@Webb1988b].
The COHMAP pollen "database" was a set of flat files with a fixed file format for data and for chronologies. FORTRAN programs were written to read these files and to assemble data for particular analyses. Thompson Webb III managed the COHMAP pollen database at Brown University, but as the quantity of data increased, data management became increasingly cumbersome; the data needed to be migrated to a relational database management system. Discussions with E. C. Grimm led to the initiation of the North American Pollen Database (NAPD) in 1990.
At the same time in , the International Geological Correlation Project IGCP 158 [@Church1989IGCPP1] was conducting a major collaborative synthesis of paleoecological data, primarily of pollen, making the need for a pollen database painfully obvious. In the forward to the book resulting from this project [@berglund1996palaeoecological], J.L. de Beaulieu describes the role that the IGCP 158 project had in launching the European Pollen Database [@Huntley1993b]. A workshop to develop a European Pollen Database (EPD) was held in in 1. North American representatives also attended, and the organizers of NAPD and EPD commenced a long-standing collaboration to develop compatible databases. NAPD and EPD held several joint workshops and developed the same data structure. Nevertheless, the two databases were independently established, partly because Internet capabilities were not yet sufficient to easily manage a merged database. The pollen databases were developed in Paradox, which at the time was the most powerful RDBMS readily available for the PC platform. NAPD and EPD established two important protocols:
* the databases were relational and queryable
* they were publicly available.
As the success the NAPD-EPD partnership escalated, working groups initiated pollen databases for other regions, including the Latin American Pollen Database (LAPD) in 1994, the Pollen Database for and the Russian Far East (PDSRFE) in 1995, and the African Pollen Database (APD) in 1996. At its initial organizational workshop, LAPD opted to merge with NAPD, rather than develop a standalone database, and the Global Pollen Database was born. PDSRFE also followed this model. APD developed independently, but uses the table structure of GPD and EPD. Pollen database projects have also been initiated in other regions, and the GPD contains some of these data, including the Indo-Pacific Pollen Database and the Japanese Pollen Database.
```{sql, ageSamples, connection=db, output.var="sampleroundcount", echo=FALSE}
SELECT ROUND(age::bigint, -2) AS age, COUNT(*) AS n
FROM ndb.sampleages
WHERE age IS NOT NULL
GROUP BY ROUND(age::bigint, -2);
```
```{r, agedistributions, echo=FALSE, fig.alt="Histogram of sample age distributions in Neotoma for the first 200000 years of samples showing that data is heavily skewed the the present.", fig.cap="Sample age distributions in Neotoma for a subset of Neotoma data. Ages are heavily skewed to samples in the last 10,000 years, and moreso to the near-present.", warnings=FALSE}
ggplot(sampleroundcount, aes(x = age, y = n)) +
geom_histogram(stat="identity", width=500) +
scale_x_continuous() +
theme_bw() +
xlab('Years Before Present') +
ylab('Age Count')
```
The pollen databases contain data from the Holocene, Pleistocene, and Pliocene, although most data are from the last 20,000 years (Fig. \@ref(fig:agedistributions)). Included are fossil data, mainly from cores and sections, and modern surface samples, which are essential for calibrating fossil data. NAPD data are not separate from the GPD, but rather the NAPD is the North American subset of GPD. EPD has both public and restricted data --- a concession that had to be made early on to assuage some contributors.
### North American Plant Macrofossil Database
Plant macrofossils include plant organs generally visible to the naked eye, including seeds, fruits, leaves, needles, wood, bud scales, and megaspores. Synoptic-scale mapping of plant macrofossils from modern assemblages [@Jackson1997] and fossil assemblages [@Jackson1997;@Jackson2000c;@Jackson2002] have shown the utility of plant macrofossils in providing spatially and taxonomically precise reconstructions of past species ranges. Although plant macrofossil records are spatially precise, synoptic networks of high-quality sites can scale up to yield aggregate views of past distributions [@Jackson1997]. In addition, macrofossils, with their greater taxonomic resolution, augment the pollen data by providing information on which species might have been present, and can resolve issues of long-distance transport [@Birks2003].
The North American Plant Macrofossil Database (NAPMD) has been directed by S.T. Jackson at the . Highest priority has been placed on data from the last 30,000 years, although some earlier Pleistocene and late Pliocene data are included. The database originated as a research database for selected taxa from Late Quaternary sediments of eastern North America [@Jackson1997]. In 1994, an effort was initiated with NOAA funding to build on this foundation to develop a cooperative, relational database comprising all of , a longer time span, and all plant taxa.
The structure of NAPMD was adapted from the pollen database and was also stored in a Paradox file format. Although the plant macrofossil database was well served by the data model structure, modifications were made to accommodate different organs from the same species and to deal with the various quantitative measures of abundance. The Plant Macrofossil database also included surface samples, which were not part of the pollen databases at the time, but are useful for the interpretation of fossil data.
### FAUNMAP {#FAUNMAP}
R.W. Graham, E.L. Lundelius, Jr., and a group of Regional Collaborators organized a project to develop a database for late Quaternary faunal data from the , which the U.S. NSF funded in 1990. This project had a research agenda, and its seminal paper focused on the individualistic behavior displayed by animal species [@Group1994].
Two FAUNMAP databases exist, FAUNMAP I and FAUNMAP II. Both databases were coordinated by R. W. Graham and E. L. Lundelius, Jr. and funded by NSF. Both are relational databases for fossil mammal sites. The data were extracted from peer-reviewed literature, selected theses and dissertations, and selected contract reports for both paleontology and archaeology (all data is currently contained within the Neotoma publications tables). Unpublished collections were not included. Data were originally captured in Paradox but were later migrated to Microsoft Access.
FAUNMAP I contains data from sites in the continental United States (the lower 48 States) that date between 500 BP and \~40,000 BP. Funding for FAUNMAP I ended in 1994, with the production of two major publications by the FAUNMAP Working Group [@Group1994;@Group1996], along with publications from individual members and many others who accessed the database on-line. Graham and Lundelius continued the FAUNMAP project, developing FAUNMAP II with funding from NSF beginning in 1998. FAUNMAP II [@faunmapTwo] shares the same structure as FAUNMAP I but expands the spatial coverage to include and and extends the temporal coverage to the Pliocene (5 Ma). In addition, sites published since 1994, when FAUNMAP I was completed, have been added for the contiguous 48 States. In all, FAUNMAP I and II contain more than 5000 fossil-mammal sites with more than 600 mammal species for all of North America north of Mexico that range in age from 0.5 ka to 5 Ma.
The detailed structure of the FAUNMAP database is described in FAUNMAP Working Group [@Group1994]. Sites identified by name and location were subdivided into Analysis Units (AU's), which varied from site to site depending upon the definitions used in the original publications (e.g., stratigraphic horizons, cultural horizons, excavation levels, biostratigraphic zones). All data (*i.e.*, taxa identified, and counts of individual specimens) and metadata (sediment types, depositional environments, facies, radiometric and other geochronological dates, modifications of bone) were associated with the individual AUs. This structure -- analysis units within sites -- allows for information to be extracts at the site level, or at the smallest sample subdivision. The analysis unit permits fine-scale temporal resolution and analysis. Similar to the GPD and NAPMD, FAUNMAP contains archival and research tables. Similar to the plant macrofossil database, FAUNMAP contains a variety of quantitative measures of abundance (*e.g.*, MNI, NISP), and presence data are more commonly used for analysis.
### BEETLE {#BEETLE}
Many beetles have highly specific ecological and climatic requirements and are valuable indicators of past environments [@morgan1983late;@ashworth2001aapg;@ashworth2004coleoptera]. Coleoptera is one of the most diverse groups of organisms on Earth, and of the insects, perhaps the most commonly preserved as fossils. Allan Ashworth has assembled a database of fossil beetles from . The data, which were recorded in Excel, contain 5523 individual records of 2567 taxa from 199 sites and 165 publications. Metadata include site name, latitude and longitude, lithology of sediment, absolute age, and geological age. The basic data are similar to plant and mammal databases -- lists of taxa from sites. The metadata have not been recorded to the extent of the other databases, especially chronological data, but Ashworth has resolved the taxonomic issues and has assembled the publications, so that the additional metadata can be easily pulled together.
## Who Will Use Neotoma?
The existing databases have been used widely for a variety of studies. Because the databases have been available on-line, precise determination of how many publications have made use of them is difficult. In addition, the databases are widely used for instructional purposes. Below are examples of the kinds of people who have used these databases and who we expect will find the new, integrated database even more useful.
* **Paleoecologists** seeking to place a new record into a regional/continental/global context (e.g., Bell and Mead 1998, Czaplewski et al. 1999, Bell and Barnosky 2000, Newby et al. 2000, Futyma and Miller 2001, Gavin et al. 2001, Czaplewski et al. 2002, Schauffler and Jacobson 2002, Camill et al. 2003, Rosenberg et al. 2003, Willard et al. 2003, Pasenko and Schubert 2004, and many others).
* **Synoptic paleoecologists** interested in mapping regional to sub-continental to global patterns of vegetation change (e.g., Jackson et al. 1997, Williams et al. 1998, Jackson et al. 2000, Prentice et al. 2000, Thompson and Anderson 2000, Williams et al. 2000, Williams et al. 2001, Williams 2003, Webb et al. 2004, Williams et al. 2004, Asselin and Payette 2005).
* **Synoptic paleoclimatologists** building benchmark paleoclimatic reconstructions for GCM evaluation (e.g., Bartlein et al. 1998, Farrera et al. 1999, Guiot et al. 1999, Kohfeld and Harrison 2000, CAPE Project Members 2001, Kageyama et al. 2001, Kaplan et al. 2003).
* **Paleontologists** trying to understand the timing, patterns, and causes of extinction events (e.g., Jackson and Weng 1999, Graham 2001, Barnosky et al. 2004, Martínez-Meyer et al. 2004, Wroe et al. 2004).
* **Evolutionary biologists** mapping the genetic legacies of Quaternary climatic variations (e.g., Petit et al. 1997, Fedorov 1999, Tremblay and Schoen 1999, Hewitt 2000, Comps et al. 2001, Good and Sullivan 2001, Petit et al. 2002, Kropf et al. 2003, Lessa et al. 2003, Petit et al. 2003, Hewitt 2004, Lascoux et al. 2004, Petit et al. 2004, Whorley et al. 2004, Runck and Cook 2005).
* **Macroecologists** interested in temporal records of species turnover and biodiversity and historical controls on modern patterns of floristic diversity (e.g., Silvertown 1985, Qian and Ricklefs 2000, Brown et al. 2001, Haskell 2001).
* **Archeologists** who are studying human subsistence patterns and interactions with their environment (e.g., Grayson 2001, Grayson and Meltzer 2002, Cannon and Meltzer 2004, Grayson in press).
* **Natural resource managers** who need to know historical ranges and abundances of plants and animals for designing conservation and management plans (e.g., Graham and Graham 1994, Cole et al. 1998, Noss et al. 2000, Owen et al. 2000, Committee on Ungulate Management in Yellowstone National Park 2002, Burns et al. 2003)
* **Scientists** trying to understand the potential response of plants, animals, biomes, ecosystems, and biodiversity to global warming (e.g., Bartlein et al. 1997, Davis et al. 2000, Barnosky et al. 2003, Burns et al. 2003, Kaplan et al. 2003, Schmitz et al. 2003, Jackson and Williams 2004, Martínez-Meyer et al. 2004)
* **Teachers** who use the databases for teaching purposes and class exercises [@goring].
<!--chapter:end:chapters/01-neotoma_introduction.Rmd-->
# Working with the Raw Data
Neotoma is a Postgres database. The data is stored on a database server in the cloud and most people interact with the database indirectly, either through the [Neotoma Explorer](https://apps.neotomadb.org), the [`neotoma2` R package](gttps://github.com/NeotomaDB/neotoma2), [Range Mapper](https://open.neotomadb.org/RangeMapper) or other tools. Much of this manual discusses the raw, underlying data that powers these tools. They all pull their data from the Neotoma API, which is an application that sends data from the database over the internet using specially constructed URLs.
An open API, like Neotoma's is useful because all you need to access the data is an internet connection and the ability to understand JSON (either by scanning it visually, or using a programming language like Python, R or JavaScript). For example:
```
https://api.neotomadb.org/v2.0/data/sites?sitename=Marion Lake
```
returns a JSON object that provides metadata about the sites in Neotoma that use the name "Marion Lake". More details about the API can be obtained from the online help for the API at https://api.neotomadb.org.
## Using the Database Locally
Users who wish to gain more experience working directly with SQL, or who need to undertake specialized analysis that is not supported by the R package or available APIs may choose to use the database directly. This involves installing PostgreSQL and associated add-ons (PostGIS in particular). Users should be aware that the database is not a program that they are commonly familiar with. The database runs in the background and users will generally "connect" to the database from R, Python, or another programming language, or they may use a database tool such as pgAdmin or dBeaver. Postgres also comes with the commandline tool `psql`, where a user can connect directly to the database from the terminal and type their queries directly.
![Using the `psql` commandline utility is one way of directly interacting with the database if you have a connection to a database server with Neotoma data loaded.](assets/using_psql.webm)
<!--chapter:end:chapters/working_with_db.md-->
## Using R
The [`neotoma2` R package](https://github.com/NeotomaDB/neotoma2) provides a set of functions to download and work with data from Neotoma within the R programming environment. The package has been designed for users who wish to work with multiple sites or datasets, and supports users who wish to add their own data into Neotoma.
The functions in the R package act as *wrappers* for the API calls, and also provide some secondary services to help the data work efficiently in the R environment. For example, a user who wishes to search for all sites named 'Marion Lake' using R simply uses the `get_sites()` function:
```{r}
library(neotoma2)
marion <- get_sites(sitename = "Marion Lake")
marion
```
The following sections will detail several workflows using the R package. To help the end user we have created an instance of RStudio that is available from the browser, with all packages installed. You can access this version of RStudio from the [myBinder link](https://mybinder.org/v2/gh/NeotomaDB/Current_Workshop/main?urlpath=rstudio):
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/NeotomaDB/Current_Workshop/main?urlpath=rstudio)
While many of our examples are drawn directly from existing workshop materials, you may find more complete details about using the R package on the GitHub page for the package, or in one or more of our posted workshops:
* European Pollen Database Workshop (May 22, 2022 -- [https://open.neotomadb.org/EPD_binder/simple_workflow.html]())
* AMQUA Pollen Workshop (May 22, 2022 -- [https://open.neotomadb.org/Workshops/AMQUA-June2022/simple_workflow.html]())
* IAL/IPA Diatom Workshop (November 2022 -- SPANISH [https://open.neotomadb.org/Workshops/IAL_IPA-November2022/simple_workflow_ES.html])
Other workshop materials are available within the [Neotoma Workshops GitHub repository](https://github.com/NeotomaDB/Workshops). Some workshops are highly specialized, and some contain links to cloud-based versions of RStudio so that users can work on the problems and workflows under standardized conditions.
<!--chapter:end:chapters/startingR.md-->
### Searching for Sites
There are several ways to find sites in `neotoma2` using the R package. We think of `sites` as being primarily spatial objects. `sites` have names, locations, and are found within the context of geopolitical units. Within the API and in the package, the site itself does not have associated information about taxa, dataset types or ages. The site is the container into which we add that information. So, when we search for sites we can search by:
| Parameter | Description |
| --------- | ----------- |
| sitename | A valid site name (case insensitive) using `%` as a wildcard. |
| siteid | A unique numeric site id from the Neotoma Database |
| loc | A bounding box vector, geoJSON or WKT string. |
| altmin | Lower altitude bound for sites. |
| altmax | Upper altitude bound for site locations. |
| database | The constituent database from which the records are pulled. |
| datasettype | The kind of dataset (see `get_tables(datasettypes)`) |
| datasetid | Unique numeric dataset identifier in Neotoma |
| doi | A valid dataset DOI in Neotoma |
| gpid | A unique numeric identifier, or text string identifying a geopolitical unit in Neotoma |
| keywords | Unique sample keywords for records in Neotoma. |
| contacts | A name or numeric id for individuals associuated with sites. |
| taxa | Unique numeric identifiers or taxon names associated with sites. |
All sites in Neotoma contain one or more datasets. It's worth noting that the results of these search parameters may be slightly unexpected. For example, searching for sites by sitename, latitude, or altitude will return all of the datasets for the particular site. Searching for terms such as datasettype, datasetid or taxa will return the site, but the only datasets returned will be those matching the dataset-specific search terms. We'll see this later.
#### Site names: `sitename="%Lago%"`
We may know exactly what site we're looking for ("Lago Grande di Monticchio"), or have an approximate guess for the site name (for example, we know it's something like "Lago Grande", or "Grande Lago Grande", but we're not sure how it was entered specifically), or we may want to search all sites that have a specific term, for example, *Lago*.
We use the general format: `get_sites(sitename="%Lago%")` for searching by name.
PostgreSQL (and the API) uses the percent sign as a wildcard. So `"%Lago%"` would pick up ["Lago Grande di Monticchio"](https://data.neotomadb.org/26607) for us (and picks up "Lago di Martignano" and "Lago Padule"). Note that the search query is also case insensitive, so you could simply write `"%LAGO%"`.
##### Code
```{r sitename, eval=FALSE}
lac_sites <- neotoma2::get_sites(sitename = "%Lago %")
plotLeaflet(lac_sites)
```
##### 3.1.1.2. Result
```{r sitenamePlot, echo=FALSE}
lac_sites <- neotoma2::get_sites(sitename = "%Lago %")
plotLeaflet(lac_sites)
```
#### 3.1.2. Location: `loc=c()` {.tabset}
The original `neotoma` package used a bounding box for locations, structured as a vector of latitude and longitude values: `c(xmin, ymin, xmax, ymax)`. The `neotoma2` R package supports both this simple bounding box, but also more complex spatial objects, using the [`sf` package](https://r-spatial.github.io/sf/). Using the `sf` package allows us to more easily work with raster and polygon data in R, and to select sites from more complex spatial objects. The `loc` parameter works with the simple vector, [WKT](https://arthur-e.github.io/Wicket/sandbox-gmaps3.html), [geoJSON](http://geojson.io/#map=2/20.0/0.0) objects and native `sf` objects in R.
As an example of searching for sites using a location, we've created a rough representation of Italy as a polygon. To work with this spatial object in R we also transformed the `geoJSON` element to an object for the `sf` package. There are many other tools to work with spatial objects in R. Regardless of how you get the data into R, `neotoma2` works with almost all objects in the `sf` package.
```{r boundingBox}
geoJSON <- '{"coordinates":
[[
[8.22, 44.13],
[12.44, 41.72],
[15.86, 37.82],
[18.61, 39.99],
[12.20, 45.39],
[13.62, 45.86],
[13.45, 46.46],
[12.21, 47.11],
[10.41, 46.69],
[6.99, 45.97],
[6.87, 44.34],
[8.22, 44.13]
]],
"type":"Polygon"}'
italy_sf <- geojsonsf::geojson_sf(geoJSON)
# Note here we use the `all_data` flag to capture all the sites within the polygon.
# We're using `all_data` here because we know that the site information is relatively small
# for Italy. If we were working in a new area or with a new search we would limit the
# search size.
italy_sites <- neotoma2::get_sites(loc = italy_sf, all_data = TRUE)
```
You can always simply `plot()` the `sites` objects, but you will lose some of the geographic context. The `plotLeaflet()` function returns a `leaflet()` map, and allows you to further customize it, or add additional spatial data (like our original bounding polygon, `sa_sf`, which works directly with the R `leaflet` package):
##### 3.1.2.1. Code
```{r plotL, eval=FALSE}
neotoma2::plotLeaflet(italy_sites) %>%
leaflet::addPolygons(map = .,
data = italy_sf,
color = "green")
```
##### 3.1.2.2. Result
```{r plotLeaf, echo=FALSE}
neotoma2::plotLeaflet(italy_sites) %>%
leaflet::addPolygons(map = .,
data = italy_sf,
color = "green")
```
<!--chapter:end:chapters/RSiteSearch.md-->
# SQL Quickly
SQL (Sturctured Query Language) is a standard language for querying and modifying relational databases. There is an official standard for SQL, which means that statements such as `SELECT` and `WHERE` are common across most SQL implementations. While many properties are common, individual vendors do maintain proprietary formatting. Neotoma is implemented using PostgreSQL, which has [extensive documentation online](https://www.postgresql.org/docs/14/index.html). It is beyond the scope of this document to teach SQL, however, we have included a large number of SQL queries along with the documentation to provide templates for queries, and to help the user understand how to directly work with the database. These queries can by typed or copied and pasted into database tools such as [pgAdmin](https://www.pgadmin.org/).
For the purposes of database development, the Neotoma team uses the following SQL tools:
* [pgAdmin](https://www.pgadmin.org/)
* [VS Code](https://code.visualstudio.com/)
* [psql](https://www.postgresql.org/docs/current/app-psql.html)
* [SchemaSpy](https://schemaspy.org/)
Throughout the Neotoma Database Manual we will refer to tables using the format `schema.tablename`. The `ndb` namespace is the schema for most of the data tables within the Neotoma database, so you will see things like `ndb.taxa` and `ndb.sites` frequently. There are other schema, including `apps`, `doi` and `public`, but the data tables themselves are kept in the `ndb` schema. You can see all available data tables in the `ndb` schema using the [Neotoma database schema website](https://open.neotomadb.org/dbschema).
## SQL Example
The following SQL example lists the number of sites by the geopolitical unit, where the geopolitical unit is a country. It uses three tables, `ndb.sites`, which contains **site** information, `ndb.geopoliticalunits`, which lists all geopolitical units (countries, provinces, towns, etc.), and a `JOIN` table, which helps to link the `ndb.sites` table to the `ndb.geopoliticalunits` table.
By convention, SQL first defines the output, using the `SELECT` statement. The `FROM` block is used to explain how the information is brought together, and the `WHERE` (and subsequent blocks) are meant to help filter and aggregate data as generated within the `FROM` block.
Throughput this document we will use `AS` statemements to help us write our SQL queries neatly, and we will use `INNER JOIN` calls with `ON` statements, rather than `NATURAL INNER JOIN` calls. Using `AS` helps us create an alias so that we don't have to write so much in our query. As much as possible we will use the same aliases for tables throughout this guide. For example, `FROM ndb.sites AS st` is commonly used, where `st` becomes the alias for the `ndb.sites` table, and stands in for it elsewhere in the query. Often you will see `gpu` used for `ndb.geopoliticalunits`, or `tx` for `ndb.taxa`. In Postgres SQL it is not neccessary to use the term `AS` (we use it here to make the queries clearer for folks just learning how to use SQL), so it's possible to see things like `ndb.sites st` in place of `ndb.sites AS st`.
The `NATURAL INNER JOIN` assumes that any columns with common names are used in a `JOIN`. Because Neotoma enforces naming consistency (in general) between [primary and foreign keys](#table-keys-table-keys), we ought to be able to perform `NATURAL` joins. However, a trigger on most Neotoma tables adds the columns `recdatecreated` and `recdatemodified` to almost all tables. For this reason we make it practice within SQL queries to be explicit about the joining columns.
The following example show the use of queries with and without `AS`:
### SQL Query
```{sql, sampleCall, echo=TRUE, connection=db, output.var="geopoliticalsort"}
SELECT
COUNT(*) AS sites,
gpu.geopoliticalname,
gpu.geopoliticalunit
FROM
ndb.geopoliticalunits AS gpu
INNER JOIN ndb.sitegeopolitical AS sgp ON sgp.geopoliticalid = gpu.geopoliticalid
INNER JOIN ndb.sites st ON st.siteid = sgp.siteid
WHERE
gpu.geopoliticalunit = 'country'
GROUP BY
gpu.geopoliticalid
HAVING COUNT(*) > 0
ORDER BY sites DESC;
```
```{r}
DT::datatable(geopoliticalsort)
```
## Table Keys {#table-keys}
Within tables there are often Keys. A Key may be a **Primary Key** (PK), which acts as a unique identifier for individual records within a table, or they may be a **Foreign Key** (FK) which refers to a unique identifier in another table. Primary Keys and Foreign Keys are critical to join tables in a SQL query.
In relational databases such as Neotoma you will often have "entity" tables, that represent explicit things (e.g., people, sites), and then tables that express the relationship between entities using foreign keys. So, for example the table `ndb.sitegeopoliticalunits` expresses the relationship between `ndb.sites` and `ndb.geopoliticalunits`. Both `ndb.sites` and `ndb.geopoliticalunits` represent entities; a country is a thing, with properties, as is a site. The `ndb.sitegeopoliticalunits` defines the relationship betwen these things. It is a table with two columns, one is an FK that refers back to the PK, `siteid`s, in `ndb.sites`, and one is an FK for the primary key of the `ndb.geopoliticalunits`.
Switching to the *SQL Query* tab in the above example will show you that we are using the PK for the `geopoliticalunits`, `geopoliticalid`. It links the complete information in the `ndb.geopoliticalunits` table to the `ndb.sites` table, through a `JOIN` table, that has two columns, the `siteid` that is the PK for `ndb.sites` and the `geopoliticalid` column, that is the PK for `ndb.geopoliticalunits`. In this way, `siteid` is the PK in `ndb.sites` and an FK in `ndb.sitegeopoliticalunits`.
## Data Types
Neotoma uses several standard SQL data types. You can find out more about these individual data types in the Postgres [data type documentation](https://www.postgresql.org/docs/current/datatype.html).
### Query {.tabset}
#### SQL Query
```{sql, dataTypeCall, echo=FALSE, connection = db, fig.cap="Common data types within the Neotoma database.", output.var="datatypes"}
SELECT LOWER(data_type) AS DataType, count(*) AS columns
FROM information_schema.columns
WHERE table_schema = 'ndb'
GROUP BY data_type
HAVING COUNT(*) > 1;
```
#### Data {.active}
```{r}
DT::datatable(datatypes)
```
<!--chapter:end:chapters/sql_quickly.Rmd-->
# Database Design Concepts
The Neotoma database itself is a highly normalized object that contains over 150 distinct tables, allowing us to uniquely and precisely describe individual occurrences of paleoecological elements in space and time, and ascribe to those observations information about the mode of preparation, collection and publication. An [interactive database schema](https://open.neotomadb.org/dbschema) is available through the Neotoma website.
![**Figure 1**. The Entity-Relationship Diagram for the Neotoma Database. This image is a scalable vector graphic and can be opened in a new window by right-clicking the image and selecting "Open image in new tab". Tables are represented by boxes, and the lines connecting individial tables represent primary key -> foreign key relationships.](assets/images/ERDFullNeotoma.svg)
## Sites, Collection Units, Analysis Units, Samples, and Datasets {#sitedesign}
Fossil data are **site** based. A [`site`](#Sites) has a name, latitude-longitude coordinates (with a projection), altitude, and areal extent. In Neotoma, *sites* are designated geographically as points or polygons, and subsequently converted into bounding boxes with north and south latitude coordinates and east and west longitude coordinates. Neotoma uses a Postgres database, and represents spatial data using [postgis's geography object](http://postgis.net/workshops/postgis-intro/geography.html). Most legacy sites in Neotoma are represented only using point coordinates.
![**Figure 2**. Three panels showing context for Neotoma's geographic representation of sites. In panel **a** a site is defined by the boundaries of a lake. The site also has a bounding box, and the core location is defined by a collection unit within the site that is defined with precise coordinates. In panel **b** a site is defined as a single point, for example, from a textual reference indicating the site is at the intersection of two roads. Here the site and collection unit share the unique point location. In panel **c** we show how that site location may be obfuscated using a bounding box as the site delimiter. In this case the collection unit would not be defined (but is represented as the triangle for illustration).](assets/images/siteboundarydiagram.svg)
The lat-long box can circumscribe the site, for example a lake (*Figure 2a*), or it may circumscribe a larger area in which the site lies either because the exact location of the site is not known or because the exact location is purposely kept vague (*Figure 2c*). In the case of many legacy sites, the exact location is not know precisely; for example, it may have been described as *on a gravel bar 5 miles east of town*. The exact locations of some sites have purposely been kept vague to prevent looting and vandalism.
A [`Collection Unit`](#CollectionUnits) is a unit from a site from which a collection of fossils or other data have been made. Typical Collection Units are cores, sections, and excavation units. A site may have several Collection Units. A Collection Unit is located spatially within a site and may have precise GPS latitude-longitude coordinates. Its definition is quite flexible. For pollen data, a Collection Unit is typically a core, a section, or surface sample. A Collection Unit can also be a composite core comprised of two or more adjacent cores pieced together to form a continuous stratigraphic sequence. A Collection Unit can also be an excavation unit. For faunal data, a Collection Unit could be as precise as an excavation square, or it could be a group of squares from a particular feature within a site. For example, consider a pit cave with three sediment cones, each with several excavation squares. Collection Units could be defined as the individual squares, or as three composite Collection Units, one from each sediment cone. Another example is an archaeological site, from which the reported Collection Units are different structures, although each structure may have had several excavation squares. The precision in the database depends on how data were entered or reported.
For many published sites, the data are reported from composite Collection Units. If faunal data are reported from a site or locality without explicit Collection Units, then data are assigned to a single Collection Unit with the name *Locality*.
Different kinds of data may have been collected from a single Collection Unit, for example fauna and macrobotanicals from an excavation, or pollen and plant macrofossils from a lake-sediment core. A composite Collection Unit may include data from different milieus, which, nevertheless, are associated with each other, for example a diatom sample from surficial lake sediments and an associated lake-water sample for water-chemistry measurements.
The Collection Unit is equivalent to the *Entity* in the Global Pollen Database but was not defined in FAUNMAP. When the FAUNMAP data were imported into Neotoma, most localities were assigned a single *Locality* Collection Unit. However, for some localities, the data were assigned to different Collection Units that were clearly identifiable in FAUNMAP (see **Figure 3**).
![image1](assets/images/image5.png)
**Figure 3**. A diagram showing the relationships between tables in Neotoma, the Pollen Database, and FAUNMAP. Because the Global Pollen Database had only pollen, no need existed for the concept of Analysis Units, which may have multiple data types. FAUNMAP did not make a hierarchical distinction between Collection Units and Analysis Units, and the data for both Analysis Units and fauna are contained in the Faunal table, although within the Faunal table, implicit one-to-many relationships exist between Localities and Analysis Units and between Analysis Units and faunal data.
An [`Analysis Unit`](#AnalysisUnits) is a stratigraphic unit within a Collection Unit and is typically defined in the vertical dimension. An Analysis Unit may be a natural stratigraphic unit with perhaps irregular depth and thickness or it may be an arbitrary unit defined by absolute depth and thickness. An excavation may have been dug in arbitrary units, for example 10 cm levels, or it may have followed natural stratagraphic boundaries, for example the *red zone* or a feature in an archaeological site. Although Analysis Units could be designated by an upper depth and lower depth, in Neotoma they are designated by their midpoint depth and thickness, which is more convenient for developing age models. Pollen and other microfossils are typically sampled at arbitrary depths, and although these samples have thicknesses corresponding to the thickness of the sampling device (usually 1 cm or less), these thicknesses are often not reported, just the depths. Different kinds of samples may have been taken from a single analysis unit, for example pollen, diatoms, and ostracodes. The Analysis Unit links these various samples together.
In larger excavations, natural stratigraphic Analysis Units may cut across excavation squares or Collection Units, and the data are reported by Analysis Unit rather than by Collection Unit. In this case, the fossil data are assigned to a generic composite Collection Unit named *Locality*, which has the explicitly defined Analysis Units. If the Analysis Units are not described or reported, then the data are assigned to a single Analysis Unit with the name *Assemblage*. Thus, for a locality published with only faunal list, the fauna are assigned to a Collection Unit named *Locality* and to an Analysis Unit named *Assemblage*.
In FAUNMAP, Analysis Units are the primary sample units, and fauna are recorded by Analysis Unit. In the GPD, Analysis Units correspond to samples.
[`samples`](#Samples) are of a single data type from an Analysis Unit. For example, there may be a vertebrate faunal sample and a macrobotanical sample from the same Analysis Unit; or there may be a pollen sample and an ostracode sample from the same Analysis Unit. There can be multiple samples of the same data type from an Analysis Unit, for example two pollen samples counted by different analysts. Normally, vertebrate fossils from an Analysis Unit comprise a single sample; however, if the fossils are of mixed age, individually dated bones may be treated as separate samples, each with a precise age. In addition to fossils, samples may also be used for physical measurements, such as loss-on-ignition. Geochronologic measurements, such as radiocarbon dates, are made on geochronologic samples.
A [`dataset`](#Datasets) is a set of Samples of a single data type from a Collection Unit. For example the pollen data from a core comprise a pollen Dataset. The geochronologic samples from a Collection Unit form a geochronologic Dataset. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).
## Taxa and Variables
In general, a single `sample` in Neotoma has an associated set of taxa, from the same dataset type, with some measure of abundance. The [`data`](#Data) table in Neotoma has fields for `sampleid`, `variableid`, and `value`. **Variables** -- listed in the [`variables`](#Variables) table -- consist of a **`taxon`**, referenced by the [`taxa`](#Taxa) table, as well as the identified **`element`**, measurement **`units`**, **`context`**, and **`modification`**. Because of the way Neotoma data is structured, non-biological elements are also contained within the `taxa` table; for example, there is an entry for *loss-on-ignition*. For this reason, the field `taxagroupid` identifies the "taxon" type, and the hierarchy (the column `highertaxonid`) helps the user understand the context of the taxon name.
**Table 1.** The `taxa` table entry for *loss-on-ignition*, showing the `taxagroupid`. The `highertaxonid` value of 5783 points to the `loss-on-ignition` entry itself, it does not belong to a hierarchy.
taxonid | taxoncode | taxonname | author | valid | highertaxonid | extinct | taxagroupid | publicationid | validatorid | validatedate | notes
---------+-----------+------------------+--------+-------+---------------+---------+-------------+---------------+-------------+--------------+-------+---------------------+---------------------
5783 | LOI | loss-on-ignition | | t | 5783 | f | LOI | | 44 | 2013-02-18 |
* For biological taxa, the **`element`** is the organ or skeletal element. Typical faunal elements are bones, teeth, scales, and other hard body parts. Bone and tooth elements may be specifically identified (e.g. *tibia* or even more precisely *tibia, distal, left*, *M2, lower, left*). Some soft elements also occur in the database (e.g. *hair* and *dung*). For mammals, an unspecified element is *bone/tooth*. Elements for plant macrofossils are the organs identified (e.g. *seed*, *needle*, *cone bract*). Pollen and spores are treated simply as taxon elements. Thus, *Picea* seeds, *Picea* needles, and *Picea* pollen are three different `variables`. All three refer to a single entry in the [`taxa`](#Taxa) table for *Picea*.
* Variable **Units** are the measurement units. For faunal data, the most common are *present/absent*, *number of individual specimens* (NISP), and *minimum number of individuals* (MNI). Plant macrofossils have many different quantitative and semi-quantitative measurement Units, including concentrations and relative abundance scales. Measurement Units for pollen are NISP (counts) and *percent*. For pollen the preferred measurement Unit is NISP, but for some sites only percentage data are available. *Picea* pollen NISP and *Picea* pollen percent are two different Variables.
* Variable **Contexts** for fauna include *articulated*, *intrusive*, and *redeposited*. A context for pollen is *anachronic*, which refers to a pollen type known to be too old for the contemporary sedimentary deposit. Most Variables do not have a specified context.
* Variable **Modifications** include various modifications to fossils or modifiers to Variables, including human modifications to bones (e.g. *bone tool*, *human butchering*, *burned*) and preservational and taphonomic modifications (e.g. *carnivore gnawed*, *fragment*). Modifications for pollen include preservational classifications such as *corroded* and *degraded*.
## Taxonomy and Synonymy
Neotoma does not change or question identifications from original sources, although taxonomic names may be synonymized to currently accepted names. Thus, for example, the old (although still valid) non-standard plant family names such as Gramineae and Compositae are synonimized to their standard family names terminated with *-aceae*, viz. Poaceae and Asteraceae. Neotoma has not attempted to establish complete or comprehensive synonymies. However, the [`synonyms`](#Synonyms) table lists commonly encountered synonyms. The descriptions of the [`synonymtypes`](#SynonymTypes) and [`taxa`](#Taxa) tables contain fuller discussions of synonymiztions made in Neotoma.
An important feature of Neotoma is that the [`taxa`](#Taxa) table is hierarchical. Each Taxon has a HigherTaxonID, which is the TaxonID of the next higher taxonomic rank. Thus, data are stored at the highest taxonomic resolution reported by the original investigators, but can be extracted
at a higher taxonomic level.
Synonymy presents a challenge for any organismal database, particularly for one such as Neotoma, which archives data collected for over a century and which archives extinct taxa, often for which few and fragmentary specimens exist. Many changes are due to increased understanding of the diversity within taxonomic groups and of the phylogenetic relationships within and among groups. Other changes are due purely to taxonomic rules or conventions set by the International Code of Botanical Nomenclature (McNeill et al. 2006) and the International Code of Zoological Nomenclature (International Commission on Zoological Nomenclature 1999). Working groups representing the different taxonomic groups included in Neotoma have established appropriate taxonomic authorities:
* Plants -- There is no worldwide authority. The International Plant Names Index[^1] lists validly published names, but a listed name is not necessarily the accepted name for a given taxon. For families, Neotoma follows the Angiosperm Phylogeny Group II (2003) and Stevens (2007+), which follows and updates APG II. The APG is an international consortium of plant taxonomists, and the APG classification utilizes the great quantity of phylogenetic data generated in recent years. For lower taxonomic ranks, the various pollen database cooperatives follow appropriate regional floras:
* North American Pollen Database/North American Plant Macrofossil Database: Insofar as possible, follows the *Flora of North America* (Flora of North America Editorial Committee 1993+); about half of the planned FNA volumes have been published. Otherwise, appropriate regional floras are followed.
* European Pollen Database: The EPD has a Taxonomy Support Group. In general, nomenclature follows *Flora Europaea* (Tutin 1964-1993).
* African Pollen Database: The APD has a Committee for Nomenclature, which has produced a list of pollen types with misspellings, synonymy, and nomenclature corrected[^2]. APD nomenclature follows *Enumération des plantes à fleurs d\'Afrique Tropicale* (Lebrun and Stork 1991-1997).
* Latin American Pollen Database: has a tremendously rich and diverse flora and no comprensive flora is available. Various regional floras are followed.
* Indo-Pacific Pollen Database: For Australia and adjacent areas follows the *Australian Plant Name Index* (Chapman 1991). For other regions, appropriate regional floras are followed.
* Pollen Database for Siberia and the Russian Far East Follows *Vascular Plants of Russia and Adjacent States* (Czerepanov 1995).
* Mammals -- For extant taxa, the authority is Wilson and Reeder's (2005) *Mammal Species of the World* . Original sources are followed for extinct species, and the database is considered authoritative.
* Birds -- For North America, the authority is the American Ornithologists' Union *Check-list of North American Birds* (American Ornithologists\' Union 1983).
* Fish -- Follows the *Catalog of Fishes* (Eschmeyer 1998).
* Mollusks -- For North America, follows *Common and Scientific Names of Aquatic Invertebrates from the United States and Canada: Mollusks* (Turgeon et al. 1998).
* Beetles -- Comprehensive manuals do not exist. Original taxonomic authorities are cited, and the database is considered authoritative.
## Taxa and Ecological Groups
In the [`taxa`](#Taxa) table, each taxon is assigned a TaxaGroupID, which refers to the [`taxagrouptypes`](#TaxaGroupTypes) table. These are major taxonomic groups, such as *Vascular plants*, *Diatoms*, *Testate amoebae*, *Mammals*, *Reptiles and amphibians*, *Fish*, and *Molluscs*. Also included are *Charcoal* and *Physical variables*.
**Ecological Groups** are groupings of taxa within Taxa Groups, which may be ecological or taxonomic. Ecological Groups are assigned in the [`ecolgroups`](#EcolGroups) table, in which taxa are assigned an `ecolgroupid`, which links to the [`ecolgrouptypes`](#EcolGroupTypes) table, and an EcolSetID, which links to the [`ecolsettypes`](#EcolSetTypes) table.
Ecological Groups are commonly used to organize taxa lists and stratigraphic diagrams. For any taxonomic group, more than one Ecological Set may be assigned. For example, beetles may be assigned to a set of ecological groups, such as dung and bark beetles, and to second set based on taxonomy. Vascular plants are assigned to a *Default plant* set comprised of groups such as *Trees and Shrubs*, *Upland Herbs*, and *Terrestrial Vascular Cryptogams*. Default pollen diagrams can then be generated based on a pollen sum of these three groups. Mammals are assigned to a *Vertebrate orders* set.
## Chronology
Neotoma stores both the archival data used to reconstruct chronologies as well as interpreted chronologies derived from the archival data. The basic data used to reconstruct chronologies occurs in three tables:
* [`geochronology`](#Geochronology),
* [`tephrachronology`](#Tephrachronology), and
* [`relativechronology`](#RelativeChronology).
The [`geochronology`](#Geochronology) table includes geophysical measurements such as radiocarbon, thermoluminescence, uranium series, and potassium-argon dates. This table also includes dendrochronological dates derived from tree-ring chronologies, for example logs in archaeological structures. The [`tephrachronology`](#Tephrachronology) table records tephras in Analysis Units. This table refers to the [`tephras`](#Tephras) lookup table, which stores the ages for known tephras. The [`relativechronology`](#RelativeChronology) table stores relative age information for Analysis Units. Relative age scales include the archaeological time scale, geologic time scale, geomagnetic polarity time scale, marine isotope stages, North American land mammal ages, and Quaternary event classification. For example, diagnostic artifacts from an archaeological site may have cultural associations with a known age ranges, which can be assigned to Analysis Units. The faunal assemblage from an Analysis Unit may be assignable to particular land mammal age, which places it within a broad time range. Sedimentary units may be assigned to particular geomagnetic chrons, marine isotope stages, or Quaternary events, such as a particular interglacial. Many of these relative ages have rather broad time spans, but do provide some chronologic control.
Actual Chronologies are constructed from the basic chronologic data in the [`geochronology`](#Geochronology), [`tephrachronology`](#Tephrachronology), and [`relativechronology`](#RelativeChronology) tables. These chronologies are stored in the [`chronologies`](#Chronologies) table. A Chronology applies to a Collection Unit and consists of a number of Chron Controls, which are ages assigned to Analysis Units. A Chron Control may be an actual geochronologic measurement, such as a radiocarbon date, or it may be derived from the actual measurement, such as a radiocarbon date adjusted for an old carbon reservoir or calibrated to calendar years. A Chron Control may by an average of several radiocarbon dates from the same Analyis Unit. Different kinds of basic chronologic data may be used to assign an age to an Analysis Unit, for example radiocarbon dates and diagnostic archaeological artifacts. Some relative Chron Controls are not from one of the established relative time scales. Examples of these are local biostratigraphic controls, which may be based on dated horizons from nearby sites. A familiar example in is the *Ambrosia*-rise, which marks European settlement. The exact date varies regionally, depending on when settlement occurred locally. For a given site, the date assigned to the *Ambrosia*-rise may be based on historical information about when settlement occurred or possibly on geophysical dating (e.g. ^210^Pb) of a nearby site.
![image2](assets/images/image6.png)
For continuous stratigraphic sequences, such as cores, not every Analysis Unit may have a direct date. Therefore, ages are commonly interpolated between dated Analysis Units. In this case, the ChronControls are the age-depth control points for an age model, which may be linear interpolation between Chron Controls or a fitted curve or spline.
**Figure 4. Smoothed quick radiocarbon calibration curve. At the scale of this figure the difference is mostly less than the line thickness.**
Age is measured in different time scales, the two most commn being radiocarbon years before present (^14^C yr BP) or presumed calendar years before present (cal yr BP). For a calibrated radiocarbon date, *cal yr BP* technically stands for *calibrated years before present*, i.e. calibrated to calendar years. In Neotoma, *cal yr BP* is used for both calibrated radiocarbon years and for other ages scales presumed to be in calendar years, viz. dendrochronologic years and other geochronlogic ages believed to be in calendar years. The zero datum for any *BP* age is ad 1950, regardless of its derivation. Thus, BP ages younger than ad 1950 are negative---ad 2000 = ‑50 BP.
![image3](assets/images/image7.png)
Ages may be reported in ad/bc age units, in which case bc years are stored as negative values. If ages are reported with a datum other than ad 1950 for BP years, the ages must be converted to an ad 1950 datum or to the ad/bc age scale before entry into Neotoma. For example, ^210^Pb dates are often reported relative to the year of analysis; these must be converted to either ad/bc or *cal yr BP* with an ad 1950 datum.
**Figure 5. An enlarged portion of Figure 2 showing the monontonic smoothed curve**
Radiocarbon years can be calibrated to calendar years with a calibration curve. The current calibration curve for ≤26,000 cal yr BP (=21,341^14^C yr BP) is the INTCAL04 calibration curve (Reimer et al. 2004). Various programs, both online and standalone, are available for calibrating individual radiocarbon dates, two of the more popular are CALIB[^3] (Stuiver and Reimer 1993) and OxCal[^4] (Bronk Ramsey 1995, 2001), both available online for download. Calibration of radiocarbon years beyond the INTCAL04 curve is more controversial. However, the Fairbanks0107 curve is available for calibration of radiocarbon dates to 50,000 cal yr BP, the practial limit of radiocarabon dating (Fairbanks et al. 2005, Chiu et al. 2007), with an online application[^5].
![image4](assets/images/image8.png)
**Figure 6**. Sample ages calculated from the Neotoma quick calibraton curve vs. ages calculated from traditional age models.
Calibrated radiocarbon dates better represent the true time scale and the true errors and probability distributions of the age estimates. In addition, other important paleo records, notably the ice cores and tree-ring records, have calendar-year time-scales. Therefore, for comparison among proxies and records, it is clearly desirable to place all records on the same time-scale, viz. a calendar-year time-scale. Although this goal is laudable, most of the data ingested into Neotoma from other databases is on a radiocarbon time scale. The majority of assigned ages and almost all the ages from the pollen database are interpolated ages derived from age models. The proper method for deriving calibrated ages is to calibrate the radiocarbon dates and then reinterpolate new ages between these calibrated dates.
Virtually all age models are problematic. A key problem is that most age models linearly interpolate between age-depth points or fit functions or splines to points. However, radiocarbon ages are not points, but probability distributions. Moreover, the probability distributions of calibrated ages are non-Gaussian. Each calibrated age has a unique probability distribution, and many are bimodal or multimodal. Various investigators have used different points, including the intercepts of the radiocarbon age with the calibration curve and the midpoint of the 1σ or 2σ probability distributon. The former is particularly inappropriate (Telford et al. 2004b). The 50% median probability is probably the best single point; however, because of multimodality, this particular point may, in fact, be very unlikely. Nevertheless, if it falls between more-or-less equally probable modes, it may still be the best single point. Most age models for cores are based on relataively few radiocarbon dates, and the uncertainties of the interpolated ages are unknown and large (Telford et al. 2004a). Indeed, chronology is perhaps the greatest challenge for future research with this database.
![image5](assets/images/image9.png)
**Figure 7**. Anomalies (Sample ages from Neotoma default calendar-year age models minus ages calculated with the Neotoma quick calibration curve) vs. time.
Given the need for a common age scale and the enormity of the task to properly develop new age models, a [`radiocarboncalibration`](#RadiocarbonCalibration) conversion table was developed to quickly convert sample ages in radiocarbon years to calendar years. These calibrated ages are for perusal and data exploration; however, the differences between these ages and those calculated with traditional age models are relatively small. The table contains radiocarbon ages from -100 to 45,000 in 1-year increments with corresponding calibrated values. The table was generated by smoothing the INTCAL04 calibration curve with an FFT filter so that the curve is monotonically increasing, i.e. so that there are no age reversals in calibrated age. The INTCAL04 curve is in 5-yr increments from -5 to 12,500 ^14^C yr BP, 10-yr increments from 12,500 to 15,000 ^14^C yr BP, and 20-yr increments from 15,000 to 26,000 ^14^C yr BP. The FFT filter was 50 points (250 yr) for the first interval, 25 points (250 yr) for the second interval, and 10 points (200 yr) for the third interval. For the calibration beyond 26,000 ^14^C yr BP, a calibrated age was determined with the Fairbanks0107 calibration curve every 100 years with a standard deviation of ±100 years from 20,000±100 ^14^C yr BP to 46,700±100 ^14^C yr BP. These were then smoothed with a 5-sample (500-yr) FFT filter. The curve kinks sharply after 45,000 ^14^C yr BP, so the quick calibration curve was terminated at this date. The Fairbanks0107 curve diverges somewhat from the INTCAL04 curve for the portion they overlap in age. From 20,000 to 26,000 ^14^C yr BP, the difference was prorated linearly from zero divergence from the INTCAL04 curve at 20,000 ^14^C yr BP to zero divergence from the Fairbanks0107 curve at 26,000 ^14^C yr BP. **Figure 4** shows the smoothed curve, and Error! Reference source not found. shows an enlargement of part of the curve.
An analysis was made to assess the deviation between ages derived from traditionally calibrated age models and ages derived from the quick calibration curve. From the database, 57 default Chronologies in calibrated radiocarbon years were selected. The Chron Controls were all calibrated radiocarbon dates, except for top dates, European settlement dates, and ^210^Pb dates in the uppermost portions of the cores. A few Chronologies used the Zdanowicz et al. (1999) calendar-year age from the GISP2 ice core. Ages beyond the reliable age limit (`chronologies.ageboundolder`[#Chronologies]) were not used. These 57 Chronologies had a total of 1945 Sample Ages in calibrated radiocarbon years. **Figure 4** shows graph of ages from the Neotoma age models vs. the ages calculated with the quick calibration curve. Error! Reference source not found. shows the anomalies vs. time and **Figure 6** shows a histogram of the distribution of anomalies. Nearly half (47%) of the anomalies are \<25 years, 86% are \<100 years, 97% are \<200 years, and 99.4% are \<300 years. The average absolute anomaly is 49.2 years, and the median is 29 years. Thus, the quick calibration curve provides remarkably good results. The ages have no confidence limits, but neither do the interpolated ages of most age models.
![image6](assets/images/image10.png)
Figure 8. Binned distribution of anomalies between Neotoma default calendar-year age models and ages calculated with the Neotoma quick calibration curve.
## Sediment and Depositional Environments
Several tables deal with depositional environments, depositional agents, and sediment descriptions. In Neotoma, the **Depositional Environment**
refers to the Depositional Environment of the site today, for example, *Fen*, *Cave*, *Colluvial Fan*. Depositional Environments may vary within a Site. For example, a lake with a marginal fen has lake and fen Depositional Environments. Thus, Depositional Environments are an attribute of Collection Units and are assigned in the [`collectionunits`](#CollectionUnits) table. Depositional Environments are listed in the in the [`depenvttypes`](#DepEnvtTypes) lookup table, and they are hierarchical, for example:
> Glacial Lacustrine
Any of these Depositional Environments may be assigned to a Collection Unit, but because they are hierarchical, Collection Units may be grouped at higher levels, for example, all Collection Units from natural lakes. The top level Depositional Environments, with some examples, are:
```{sql, topleveldepenv, connection=db}
SELECT depenvt AS "Depositional Environment"
FROM ndb.depenvttypes
WHERE depenvtid = depenvthigherid
LIMIT 10;
```
The Depositional Environment may change through time. For example, as a basin fills with sediment, it may convert from a lake to a fen and perhaps later to a bog. A colluvial slope may have alluvial sediments at depth. A modern playa lake may have a buried paleosol. Thus, a sediment section may have units with different facies and depositional agents. The **Facies** is the sum total of the characteristics that distinguish a sedimentary unit. Facies are listed in the
[`faciestypes`](#FaciesTypes) lookup table and are assigned to Analysis Units in the `AnalysisUnits.FaciesID <AnalysisUnits>`{.interpreted-text role="ref"} field. A sedimentary unit may have one or more agents of deposition. For example, a cave deposit may be partly owing to human habitation and partly to carnivore activity. Depositional Agents are listed in the [`depagenttypes`](#DepAgentTypes) lookup
table and are assigned to Analysis Units in the [`depagents`](#DepAgents) table.
Whereas Facies and Depositional Agents are both keyed to Analysis Units, the [`lithology`](#Lithology) table is keyed to Collection Units. Analysis Units, especially from cores, may not be contiguous but be placed at discrete intervals down section. Lithologic units are defined by depth in the Collection Unit. Whereas Facies have short descriptions and are keyed to the [`faciestypes`](#FaciesTypes) lookup table, the `Lithology.Description`[#Lithology] field is a memo, and lithologic descriptions much more detailed than Facies descriptions. FAUNMAP, which was built around Analysis Units, stores Facies and Depositional Agent data; whereas the pollen database, which was centered on Collection Units, stores lithologic data.
## Date Fields
Neotoma uses date fields in several tables. Dates are stored internally as a double precision floating point number, which facilitates calculations and functions involving dates. The disadvantage is that complete dates must be stored, i.e. year, month, and day; whereas in many cases only the year or month are known, for example the month a core was collected. Neotoma had adapted the convention that if only the month is known, the day is set to the first of the month; if only the year is known, the month and day are set to January 1. Thus, *June 1984* is set to *June 1, 1984*; and *1984* is set to *January 1, 1984*. The drawback, of course, is that these imprecise dates cannot be distinguished from precise dates on the first of the month. However, it was determined that the advantages of the date fields outweighed this disadvantage.
[^1]: <http://www.ipni.org>
[^2]: <http://medias.obs-mip.fr/apd/>
[^3]: <http://calib.qub.ac.uk/calib/>
[^4]: <http://c14.arch.ox.ac.uk/embed.php?File=oxcal.html>
[^5]: <http://radiocarbon.ldeo.columbia.edu/research/radcarbcal.htm>
<!--chapter:end:chapters/db_design_concepts.Rmd-->
# Neotoma Tables
The Neotoma Database contains more than 150 tables, and, as new proxy types get added or new metadata is stored, the number of tables may increase. As a result, this manual should not be considered the final authority, but it should provide nearly complete coverage of the database and its structure. In particular, do our best to divide tables into logical groupings: Chronology & Age related tables, Dataset related tables, Site related tables, Contact tables, Sample tables and so on.
## Site Related Tables
Tables for key geographic information relating to the dataset. Specifically geographic coordinates, geo-political units and any situational information such as images of the site itself.
* [`geopoliticalunits`](GeoPoliticalUnits)
* [`lakeparameters`](#lakeparameters)
* [`lakeparametertypes`](#lakeparametertypes)
* [`sitegeopolitical`](#SiteGeoPolitical)
* [`siteimages`](#SiteImages)
* [`sites`](#Sites)
## Dataset & Collection Related Tables
Tables related to complete datasets, or collections of samples. These include Collection information, but only refer to sites, since, as described in the Design Concepts, datasets are conceptually nested within sites, even if a site contains only a single dataset.
* [`accumulationrates`](#accumulationrates)
* [`aggregatedatasets`](#AggregateDatasets)
* [`aggregateordertypes`](#AggregateOrderTypes)
* [`collectiontypes`](#CollectionTypes)
* [`collectionunits`](#CollectionUnits)
* [`contextsdatasettypes`](#contextsdatasettypes)
* [`datasetdatabases`](#datasetdatabases)
* [`datasetdoi`](#datasetdoi)
* [`datasetpis`](#DatasetPIs)
* [`datasetpublications`](#DatasetPublications)
* [`datasets`](#Datasets)
* [`datasetsubmissions`](#DatasetSubmissions)
* [`datasetsubmissiontypes`](#DatasetSubmissionTypes)
* [`datasettaxagrouptypes`](#datasettaxagrouptypes)
* [`datasettaxonnotes`](#datasettaxonnotes)
* [`datasettypes`](#DatasetTypes)
* [`datasetvariables`](#datasetvariables)
* [`depenvttypes`](#DepEnvtTypes)
## Chronology & Age Related Tables
Information about the age models and chronological controls used to assess sample ages. Includes secondary information on tephras, and geochronological data types.
* [`agetypes`](#AgeTypes)
* [`aggregatechronologies`](#AggregateChronologies)
* [`aggregatesampleages`](#AggregateSampleAges)
* [`calibrationcurves`](#calibrationcurves)
* [`calibrationprograms`](#calibrationprograms)
* [`chroncontrolaccuracydirections`](#chroncontrolaccuracydirections)
* [`chroncontrolaccuracydistributions`](#chroncontrolaccuracydistributions)
* [`chroncontrolaccuracyranks`](#chroncontrolaccuracyranks)
* [`chroncontrolprecisionranks`](#chroncontrolprecisionranks)
* [`chroncontrolranks`](#chroncontrolranks)
* [`chroncontrols`](#ChronControls)
* [`chroncontrolscal14c`](#chroncontrolscal14c)
* [`chroncontroltypes`](#ChronControlTypes)
* [`chronologies`](#Chronologies)
* [`eventchronology`](#eventchronology)
* [`eventpublications`](#eventpublications)
* [`events`](#events)
* [`eventtypes`](#eventtypes)
* [`geochroncontrols`](#geochroncontrols)
* [`geochronology`](#Geochronology)
* [`geochronpublications`](#GeochronPublications)
* [`geochrontypes`](#GeochronTypes)
* [`radiocarbon`](#radiocarbon)
* [`radiocarboncalibration`](#RadiocarbonCalibration)
* [`radiocarbonmethods`](#radiocarbonmethods)
* [`relativeagepublications`](#RelativeAgePublications)
* [`relativeages`](#RelativeAges)
* [`relativeagescales`](#RelativeAgeScales)
* [`relativeageunits`](#RelativeAgeUnits)
* [`relativechronology`](#RelativeChronology)
* [`tephras`](#Tephras)
## Sample Related Tables
Information relating to individual samples or analysis units. This includes the age of the sample, the data content of the sample, and information relating to the physical condition or situation of the samples themselves.
* [`aggregatesamples`](#AggregateSamples),
* [`analysisunitaltdepths`](#analysisunitaltdepths)
* [`analysisunitaltdepthscales`](#analysisunitaltdepthscales)
* [`analysisunitlithostrat`](#analysisunitlithostrat)
* [`analysisunits`](#AnalysisUnits),
* [`data`](#Data)
* [`datataxonnotes`](#datataxonnotes)
* [`depagents`](#DepAgents)
* [`depagenttypes`](#DepAgentTypes),
* [`elementdatasettaxagroups`](#elementdatasettaxagroups)
* [`elementmaturities`](#elementmaturities)
* [`elementportions`](#elementportions)
* [`elementsymmetries`](#elementsymmetries)
* [`elementtaxagroupmaturities`](#elementtaxagroupmaturities)
* [`elementtaxagroupportions`](#elementtaxagroupportions)
* [`elementtaxagroups`](#elementtaxagroups)
* [`elementtaxagroupsymmetries`](#elementtaxagroupsymmetries)
* [`elementtypes`](#elementtypes)
* [`faciestypes`](#FaciesTypes),
* [`keywords`](#Keywords)
* [`lithology`](#Lithology)
* [`lithostrat`](#lithostrat)
* [`rocktypes`](#rocktypes)
* [`sampleages`](#SampleAges),
* [`sampleanalysts`](#SampleAnalysts),
* [`samplekeywords`](#SampleKeywords),
* [`samples`](#Samples),
* [`summarydatataphonomy`](#summarydatataphonomy)
## Specimen Related Tables
* [`fractiondated`](#fractiondated)
* [`repositoryinstitutions`](#repositoryinstitutions)
* [`repositoryspecimens`](#repositoryspecimens)
* [`specimendates`](#specimendates)
* [`specimendatescal`](#specimendatescal)
* [`specimendomesticstatustypes`](#specimendomesticstatustypes)
* [`specimengenbank`](#specimengenbank)
* [`specimens`](#specimens)
* [`specimensextypes`](#specimensextypes)
* [`specimentaphonomy`](#specimentaphonomy)
* [`taphonomicsystems`](#taphonomicsystems)
* [`taphonomicsystemsdatasettypes`](#taphonomicsystemsdatasettypes)
* [`taphonomictypes`](#taphonomictypes)
## Taxonomy Related Tables
Tables related to taxonomic information, phylogenetic information and ecological classifications. These tables also include hierarchy based on morphological or phylogenetic relationships.
* [`ecolgroups`](#EcolGroups)
* [`ecolgrouptypes`](#EcolGroupTypes)
* [`ecolsettypes`](#EcolSetTypes)
* [`externaltaxa`](#externaltaxa)
* [`formtaxa`](#formtaxa)
* [`synonyms`](#Synonyms)
* [`synonymtypes`](#SynonymTypes)
* [`synonymy`](#synonymy)
* [`taxa`](#Taxa)
* [`taxaalthierarchy`](#taxaalthierarchy)
* [`taxaalthierarchytypes`](#taxaalthierarchytypes)
* [`taxagrouptypes`](#TaxaGroupTypes)
* [`taxonpaths`](#taxonpaths)
* [`variablecontexts`](#VariableContexts)
* [`variableelements`](#VariableElements)
* [`variables`](#Variables)
* [`variableunits`](#VariableUnits)
## Individual Related Tables
Tables associated with individuals, institutions and organizations.
* [`collectors`](#Collectors),
* [`contacts`](#Contacts),
* [`contactstatuses`](#ContactStatuses)
* [`dataprocessors`](#dataprocessors)
## Publication Related Tables
Information relating to the publication of primary or derived data within the Neotoma Paleoecological Database.
* [`externalpublications`](#externalpublications)
* [`publicationauthors`](#PublicationAuthors),
* [`publicationeditors`](#PublicationEditors),
* [`publications`](#Publications),
* [`publicationtranslators`](#publicationtranslators)
* [`publicationtypes`](#PublicationTypes)
## Supporting Resources
* [`constituentdatabases`](#constituentdatabases)
* [`embargo`](#embargo)
* [`externaldatabases`](#externaldatabases)
## Isotope Data Management
* [`isobiomarkerbandtypes`](#isobiomarkerbandtypes)
* [`isobiomarkertypes`](#isobiomarkertypes)
* [`isoinstrumentation`](#isoinstrumentation)
* [`isoinstrumentationtypes`](#isoinstrumentationtypes)
* [`isomatanalsubstrate`](#isomatanalsubstrate)
* [`isomaterialanalyzedtypes`](#isomaterialanalyzedtypes)
* [`isometadata`](#isometadata)
* [`isopretreatmenttypes`](#isopretreatmenttypes)
* [`isosampleintrosystemtypes`](#isosampleintrosystemtypes)
* [`isosampleorigintypes`](#isosampleorigintypes)
* [`isosamplepretreatments`](#isosamplepretreatments)
* [`isoscaletypes`](#isoscaletypes)
* [`isospecimendata`](#isospecimendata)
* [`isosrmetadata`](#isosrmetadata)
* [`isostandards`](#isostandards)
* [`isostandardtypes`](#isostandardtypes)
* [`isostratdata`](#isostratdata)
* [`isosubstratetypes`](#isosubstratetypes)
* [`isovariablescaletypes`](#isovariablescaletypes)
<!--chapter:end:chapters/tables/neotoma_tables.Rmd-->
# Contact and Individual Related Tables
## `collectors` {#Collectors}
The `collectors` table lists the people who were responsible for collecting a particular collection unit. This is part of the complete set of individuals who collect, analyze, publish and process the data for contribution to Neotoma. Other related tables include [`publicationsauthors`](#PublicationAuthors), [`sampleanalysts`](#SampleAnalysts), and [`datasetsubmissions`](#DatasetSubmissions).
```{r, collectorstable, echo=FALSE}
showTable(db, 'collectors')
```
* **`collectorid` (primary key)**: An arbitrary Collector identification number.
* **`collectionunitid` (foreign key)**]: The `collectionunitid` for the record that was collected. Field links to the [`collectionunits`](#CollectionUnits) table.
* **`contactid` (foreign key)**: Person who collected the CollectionUnit. Multiple individuals are listed in separate records. Field links to the [`contacts`](#Contacts) table.
* **`collectororder`**: Order in which the collectors should be listed.
### SQL Example
We want to see the top contributors for each constitutent database, so we calculate the sum of each appearance of a contributor/database, and then return the `COUNT(*)`. The `DISTINCT ON` pattern allows us to select a single instance of the field `db.databasename`, that has been ordered by the descending `COUNT(*)` per individual, and database. This then ensures we have the top `collectionunit` collector for each constituent database.
```{sql, topcontribdb, connection=db}
SELECT DISTINCT ON (db.databasename)
ct.contactname,
db.databasename,
COUNT(*)
FROM ndb.contacts AS ct
INNER JOIN ndb.collectors AS co ON co.contactid = ct.contactid
INNER JOIN ndb.collectionunits AS cu ON cu.collectionunitid = co.collectionunitid
INNER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
INNER JOIN ndb.datasetdatabases AS dsdb ON dsdb.datasetid = ds.datasetid
INNER JOIN ndb.constituentdatabases AS db ON db.databaseid = dsdb.databaseid
GROUP BY db.databasename, ct.contactname
ORDER BY db.databasename, COUNT(*) DESC;
```
## `contacts` {#Contacts}
This table lists persons and organizations. The table is referenced
through Foreign Keys in the following tables:
* [`Chronologies`](#Chronologies),
* [`Collectors`](#Collectors),
* [`DatasetPIs`](#DatasetPIs),
* [`DatasetSubmissions`](#DatasetSubmissions),
* [`Projects`](#Projects),
* [`PublicationAuthors`](#PublicationAuthors),
* [`SampleAnalysts`](#SampleAnalysts), and
* [`SiteImages`](#SiteImages) tables.
```{r, contactstable, echo=FALSE}
showTable(db, 'contacts')
```
* **`contactid (primary key)`**: An arbitrary `contactid` number.
* **`aliasid (foreign key)`**: The `contactid` of a person's current name. If the `aliasid` is different from the `contactid`, the `contactid` refers to the person's former name. For example, if J. L. Bouvier became J. B. Kennedy, the `contactid` for J. B. Kennedy is the `aliasid` for J. L. Bouvier.
* **`contactname`**: Full name of the person, last name first (e.g. *Simpson, George Gaylord*) or name of organization or project (e.g. *Great Plains Flora Association*).
* **`contactstatusid (foreign key)`**: Current status of the person, organization, or project. Field links to the [`contactstatuses`](#ContactStatuses) lookup table.
* **`familyname`**: Family or surname name of a person.
* **`leadinginitials**: leading initials for given or forenames **without`** spaces (e.g. *G.G.*).
* **`givennames`**: Given or forenames of a person (e.g. *George Gaylord*). Initials with spaces are used if full given names are not known (e.g. *G. G*).
* **`suffix`**: Suffix of a person's name (e.g. «Jr.», «III»).
* **`title`**: A person's title (e.g. «Dr.», «Prof.», «»).
* **`phone`**: Telephone number.
* **`fax`**: Fax number.
* **`email`**: Email address.
* **`url`**: Universal Resource Locator, an Internet World Wide Web address.
* **`address`**: Full mailing address.
* **`notes`**: Free form notes or comments about the person, organization, or project.
## `contactstatuses` {#ContactStatuses}
Lookup table of Contact Statuses. Table is referenced by the
[`contacts`](#Contacts) table.
```{r, contactstatusestable, echo=FALSE}
showTable(db, 'contactstatuses')
```
* **`contactstatusid (primary key)`**: An arbitrary Contact Status identification number.
* **`contactstatus`**: Status of person, organization, or project.
* **`statusdescription`**:
* Description of the status. The following statuses exist (with descriptions):
* **active**: Person, project, or organization is active in the field
* **deceased**: Person is deceased
* **defunct**: Project or organization is defunct or non-operational
* **extant**: Project or organization is extant
* **inactive**: Person is inactive in the field
* **retired**: Person is retired
* **unknown**: Status is unknown
<!--chapter:end:chapters/tables/tables_contacts.Rmd-->
# Dataset & Collection Related Tables
## `accumulationrates` {#accumulationrates}
This table represents estimated accumulation rates based on particular chronologies associated with collectionunits. Units are defined within the table. Accumulation rates play an important role in understanding surficial and depositional processes [@Bennett2016], and can be critical for defining relevant priors for Bayesian chronologies [@Goring2012]. While accumulation rates can be calculated on-the-fly, Neotoma also stores accumulation rates when reported, however, at present only 11 chronologies have reported accumulation rates.
```{r, accumulationtable, echo=FALSE}
showTable(db, 'accumulationrates')
```
* **`analysisunitid` (foreign key)**: The identifier for the associated analysis unit.
* **`chronologyid` (foreign key)**
* **`accumulationrate`**: The rate at which sediment accumulations within a particular analysis unit, defined (generally) as the length of time required to accumulate some depth of sediment (*e.g.*, *yr/cm*).
* **`variableunitsid` (foreign key)**: Foreign key referencing variable units, describing the units in which the accumulation rate is reported.
## `aggregatedatasets` {#AggregateDatasets}
Aggregate Datasets are aggregates of samples of a particular [`datasettypes`](#datasettypes).
Some examples:
* Plant macrofossil samples from a group of packrat middens collected from a particular valley, mountain range, or other similarly defined geographic area. Each midden is from a different `site` or `collectionunit`, but they are grouped into time series for that area and are published as single dataset.
* Samples collected from 32 cutbanks along several kms of road. Each sample is from a different site, but they form a time series from 0 -- 12,510 ^14^C yr BP, and pollen, plant macrofossils, and beetles were published and graphed as if from a single site.
* A set of pollen surface samples from particular region or study that were published and analyzed as a single dataset and submitted to the database as a single dataset.
The examples above are datasets predefined in the database. New aggregate datasets could be assembled for particular studies, for example all the pollen samples for a given time slice for a given geographic region.
```{r, AggregateDatasetstable, echo=FALSE}
showTable(db, 'aggregatedatasets')
```
* **`aggregatedatasetid` (primary key)**: An arbitrary Aggregate Dataset identification number.
* **`aggregatedatasetname`**: Name of Aggregate Dataset.
* **`aggregateordertypeid` (foreign key)**: Aggregate Order Type identification number. Field links to the [`aggregateordertypes`](#AggregateOrderTypes) lookup table.
* **`notes`**: Free form notes about the Aggregate Order Type.
## `aggregateordertypes` {#AggregateOrderTypes}
Lookup table for Aggregate Order Types. Table is referenced by the
[`aggregatedatasets`](#AggregateDatasets) table.
```{r, aggregateordertypestable, echo=FALSE}
showTable(db, 'aggregateordertypes')
```
* **`aggregateordertypeid` (primary key)**: An arbitrary Aggregate Order Type identification number.
* **`aggregateordertype`**: The Aggregate Order Type.
* **`notes`**: Free form notes or comments about the Aggregate Order Type. The Aggregate Order Types are:
* **Latitude**: AggregateDataset samples are ordered by, in order of priority, either (1) [`collectionunits`](#CollectionUnits) GPSLatitude or (2) the mean of [`sites`](#Sites).LatitudeNorth and [`sites`](#Sites) LatitudeSouth.
* **Longitude** AggregateDataset samples are ordered by, in order of priority, either (1) [`collectionunits`](#CollectionUnits) GPSLongitude or (2) the mean of [`sites`](#Sites).LongitudeWest and [`sites`](#Sites).LongitudeEast.
* **Altitude** AggregateDataset samples are ordered by [`sites`](#Sites) Altitude.
* **Age** AggregateDataset samples are ordered by [`sampleages`](#SampleAges).Age, where [`sampleages`](#SampleAges).SampleAgeID is from [`aggregatesampleages`](#AggregateSampleAges).SampleAgeID.
* **Alphabetical by site name** AggregateDataset samples are ordered alphabetically by [`sites`](#Sites).SiteName.
* **Alphabetical by collection unit name** AggregateDataset samples are ordered alphabetically by [`collectionunits`](#CollectionUnits).CollUnitName.
* **Alphabetical by collection units handle** AggregateDataset samples are ordered alphabetically by [`collectionunits`](#CollectionUnits).Handle.
## `collectiontypes` {#CollectionTypes}
This table is a lookup table of for types of Collection Units, or Collection Types. Table is referenced by the [`collectionunits`](#CollectionUnits) table.
```{r, CollectionTypestable, echo=FALSE}
showTable(db, 'collectiontypes')
```
* **`colltypeid` (primary key)**: An arbitrary Collection Type identification number.
* **`colltype`**: The Collection Type. Types include cores, sections, excavations, and
animal middens. Collection Units may be modern collections, surface
float, or isolated specimens. Composite Collections Units include
different kinds of Analysis Units, for example a modern surface
sample for ostracodes and an associated water sample.
## `collectionunits` {#CollectionUnits}
This table stores data for Collection Units.
```{r, CollectionUnitstable, echo=FALSE}
showTable(db, 'collectionunits')
```
* **`collectionunitid` (primary key)**: An arbitrary Collection Unit identification number.
* **`siteid` (foreign key)**: Site where CollectionUnit was located. Field links to [`sites`](#Sites) table.
* **`colltypeid` (foreign key)**: Type of Collection Unit. Field links to the [`collectiontypes`](#CollectionTypes) table.
* **`depenvtid` (foreign key)**: Depositional environment of the CollectionUnit. Normally, this key refers to the modern environment. For example, the site may be located on a colluvial slope, in which case the Depositional Environment may be Colluvium or Colluvial Fan. However, an excavation may extend into alluvial sediments, which represent a different depositional environment. These are accounted for by the Facies of the AnalysisUnit. Field links to the [`depenvttypes`](#DepEnvtTypes) table.
* **`handle`**: Code name for the Collection Unit. This code may be up to 10 characters, but an effort is made to keep these to 8 characters or less. Data are frequently distributed by Collection Unit, and the Handle is used for file names.
* **`collunitname`**: Name of the Collection Unit. Examples: Core BPT82A, Structure 9, P4A Test 57. If faunal data are reported from a site or locality without explicit Collection Units, then data are assigned to a single Collection Unit with the name «Locality».
* **`colldate`**: Date Collection Unit was collected.
* **`colldevice`**: Device used for obtain Collection Unit. This field applies primarily to cores, for example «Wright square-rod piston corer (5 cm)».
* **`gpslatitude`**: Precise latitude of the Collection Unit, typically taken with a GPS, although may be precisely measured from a map.
* **`gpslongitude`**: Precise longitude of the Collection Unit, typically taken with a GPS, although may be precisely measured from a map.
* **`gpsaltitude`**: Precise altitude of the Collection Unit, typically taken with a GPS or precisely obtained from a map.
* **`gpserror`**: Error in the horizontal GPS coordinates, if known.
* **`waterdepth`**: Depth of water at the Collection Unit location. This field applies mainly to Collection Units from lakes.
* **`substrateid` (foreign key)**: Substrate or rock type on which the Collection Unit lies. Field links to the RockTypes table. This field is especially used for rodent middens.
* **`slopeaspect`**: For Collection Units on slopes, the horizontal direction to which a slope faces measured in degrees clockwise from north. This field is especially used for rodent middens.
* **`slopeangle`**: For Collection Units on slopes, the angle of slope from horizontal. The `slopeangle` field is especially used for rodent middens.
* **`location`**: Short description of the location of the Collection Unit within the site.
* **`notes`**: Free form notes or comments about the Collection Unit.
## `contextsdatasettypes` {#contextsdatasettypes}
Reporting for the set of variable contexts associated with each datasettype. This table is used to relate dataset type to variable context to provide users with a custom set of terms when entering and uploading data for a particular datasettype. For example, *Neotoma Pollen:broken* is not available for geochronologic datasets.
```{r, contextsdatasettypestable, echo=FALSE}
showTable(db, 'contextsdatasettypes')
```
* **`datasettypeid` (foreign key)**: Link to the datasettype for a particular variable.
* **`variablecontextid` (foreign key)**: Link to the variable context type.
## `datasetdatabases` {#datasetdatabases}
The constituent database to which a dataset belongs.
## `datasetdoi` {#datasetdoi}
The doi for a dataset.
## `datasetpis` {#DatasetPIs}
This table lists the Principle Investigators for Datasets.
```{r, DatasetPIstable, echo=FALSE}
showTable(db, 'datasetpis')
```