Skip to content

Commit

Permalink
case study without large files
Browse files Browse the repository at this point in the history
  • Loading branch information
petrifiedvoices committed Jun 12, 2024
1 parent 2cb5e40 commit df6f7dd
Show file tree
Hide file tree
Showing 13 changed files with 1,883 additions and 39 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ index.html?.*
already_mapped_data/
output_maps/
*.zip
*.RData
*.Rhistory



.ipynb_checkpoints/
.pytest_cache/
Expand Down Expand Up @@ -130,3 +134,4 @@ dmypy.json

# Pyre type checker
.pyre/
.Rproj.user
93 changes: 57 additions & 36 deletions EpigraphyScraper.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,58 +4,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Welcome to Latin Epigraphy Scraper (*LatEpig*) v2.0\n",
"# Welcome to *LatEpig* v2.0\n",
"\n",
"*The Jupyter Notebook inteface for the **LatEpig** tool allows you to query all the inscriptions from the Epigraphic Database Clauss Slaby (www.manfredclauss.de) in a reproducible manner: it saves the search results in a TSV file and plots them on an interactive map of the Roman Empire without any prior knowledge of programming in a matter of minutes.* \n",
"\n",
"This programme allows to extracts the output of a search query from the [Epigraphik-Datenbank Clauss / Slaby (EDCS)](http://www.manfredclauss.de/) in a reproducible manner and saves it as a TSV file (i.e, *tab separated value*) that can be easily opened in your favourite spreadsheet software, or as a JSON file. The search results can be also plotted to a map of the Roman Empire, along with the system of Roman Provinces, roads, and cities. More on the used datasets in the *Data Sources* section. \n",
"\n",
"---\n",
"\n",
"## Authors \n",
"* [Brian Ballsun-Stanton, Macquarie University, ![](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0003-4932-7912)\n",
"* [Petra Heřmánková, Aarhus University, ![](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0002-6349-0540)\n",
"* [Ray Laurence, Macquarie University, ![](https://orcid.org/sites/default/files/images/orcid_16x16.png)](https://orcid.org/0000-0002-8229-1053)\n",
"\n",
"## Cite us in your research\n",
"\n",
"**Ballsun-Stanton B., Heřmánková P., Laurence R.** Lat Epig (v2.0). GitHub. URL: https://github.com/mqAncientHistory/Lat-Epig/ DOI: 10.5281/zenodo.5211341\n",
">**Ballsun-Stanton B., Heřmánková P., Laurence R.** Lat Epig (v2.0). GitHub. URL: https://github.com/mqAncientHistory/Lat-Epig/ DOI: 10.5281/zenodo.5211341\n",
"\n",
"**License**: [GNU General Public License v3.0](https://github.com/mqAncientHistory/Lat-Epig/blob/main/LICENSE)\n",
" \n",
"If you're using this tool in your research, <!-- Place this tag where you want the button to render. -->\n",
"<a class=\"github-button\" href=\"https://github.com/mqAncientHistory/Lat-Epig\" data-color-scheme=\"no-preference: light; light: light; dark: dark;\" data-icon=\"octicon-star\" data-show-count=\"true\" aria-label=\"Star mqAncientHistory/Lat-Epig on GitHub\">Star</a> us on Github! (This way, we don't need to put tracking pixels into this notebook to get a sense of how many folks are using our tool!) \n",
"\n",
"If you find a bug or have a feature request, raise an <!-- Place this tag where you want the button to render. -->\n",
"<a class=\"github-button\" href=\"https://github.com/mqAncientHistory/Lat-Epig/issues\" data-color-scheme=\"no-preference: light; light: light; dark: dark;\" data-icon=\"octicon-issue-opened\" data-show-count=\"true\" aria-label=\"Issue mqAncientHistory/Lat-Epig on GitHub\">Issue</a>!\n",
"---\n",
"\n",
"## Data sources\n",
"\n",
"### Inscriptions\n",
"The Epigraphik-Datenbank Clauss / Slaby (EDCS) is a digital collection of more than 500,000 Latin inscriptions, created by Prof. Manfred Clauss, Anne Kolb, Wolfgang A. Slaby, Barbara Woitas, and hosted by the Universitat Zurich and Katolische Universitat Eichstat-Ingoldstadt. For more see www.manfredclauss.de\n",
"\n",
"### Interactive Map\n",
"_1. Roman Empire Boundaries & Provinces_\n",
"\n",
"[Ancient World Mapping Centre, political shading shapefiles](http://awmc.unc.edu/awmc/map_data/shapefiles/cultural_data/political_shading/), following the Barington Atlas of Greek Roman World, AWMC.UNC.EDU, under the Creative Commons Attribution-NonCommercial 4.0 International License.\n",
"\n",
"* Roman Empire 60 BC (provinces or extent)\n",
"* Roman Empire in AD 14 (provinces or extent)\n",
"* Roman Empire in AD 69 (provinces or extent)\n",
"* Roman Empire in AD 117 (DEFAULT, provinces or extent)\n",
"* Roman Empire in AD 200 (provinces or extent)\n",
"\n",
"_2. Roman Roads_\n",
"* McCormick, M. et al. 2013. \"Roman Road Network (version 2008),\" DARMC Scholarly Data Series, Data Contribution Series #2013-5. DARMC, Center for Geographic Analysis, Harvard University, Cambridge MA 02138.\n",
"\n",
"* [Ancient World Mapping Centre, road shapefiles](http://awmc.unc.edu/awmc/map_data/shapefiles/ba_roads/), shapefile for roads, following the Barington Atlas of Greek Roman World, under the Creative Commons Attribution-NonCommercial 4.0 International License. Collection of shapefiles also vailable through the UCD Digital Library\n",
"\n",
"_3. Cities_\n",
"\n",
"The shapefile of the cities used in the map is based on Hanson, J. W. (2016a). Cities Database (OXREP databases). Version 1.0. Accessed (date): http://oxrep.classics.ox.ac.uk/databases/cities/. DOI: https://doi.org/10.5287/bodleian:eqapevAn8. More info available through Hanson, J. W. (2016b). An Urban Geography of the Roman World, 100 B.C. to A.D. 300. Oxford: Archaeopress.\n",
"\n",
"---\n",
"\n",
"\n",
"# How to search for inscriptions\n",
"\n",
"**!!! First of all, go to the *`Kernel menu`* from the top bar of this Notebook and choose `Restart & Run All Cells` for the search interface to load properly.**\n",
Expand Down Expand Up @@ -88,18 +61,66 @@
"Once you are happy with your selection, press the **Get inscriptions!** button and wait for the result to show in the window below.\n",
"\n",
"### Download the results\n",
"\n",
"When the search is done, a total number of inscriptions found will show in the window below, together with two links to download the data in a **TSV file** and a **JSON file** format. Click on the format of your choice to download the link to your local computer. You can easily open a TSV file with your favourite spreadsheet software. For working with the JSON file we recommend using either R or Python.\n",
"\n",
"Note that the *file name* in both formats contains the date of your search, source of the data and how it was accessed (EDCS via *LatEpig*) and any search parametres or their combinations you have used (*Term 1, Term 2, Dating from*...) and the number of inscriptions found. This way you will always remember what you have searched for and when you share the file with a colleague or students, they can always replicate your search to see if any new inscriptions were added to EDCS.\n",
"\n",
"#### Metadata for the downloaded files\n",
"---\n",
"\n",
"## Data sources\n",
"\n",
"### Inscriptions\n",
"The Epigraphik-Datenbank Clauss / Slaby (EDCS) is a digital collection of more than 500,000 Latin inscriptions, created by Prof. Manfred Clauss, Anne Kolb, Wolfgang A. Slaby, Barbara Woitas, and hosted by the Universitat Zurich and Katolische Universitat Eichstat-Ingoldstadt. For more information see [www.manfredclauss.de](www.manfredclauss.de).\n",
"\n",
"Each file contains the information from EDCS separated into 22 attributes. The [LatEpig Metadata description](https://github.com/mqAncientHistory/Lat-Epig/LatEpig_Metadata_Description.txt) document in the Github repo describes the contents of individual attributes along with their description and original source.\n",
"### Interactive Map\n",
"_1. Roman Empire Boundaries & Provinces_\n",
"\n",
"[Ancient World Mapping Centre, political shading shapefiles](http://awmc.unc.edu/awmc/map_data/shapefiles/cultural_data/political_shading/), following the Barington Atlas of Greek Roman World, AWMC.UNC.EDU, under the Creative Commons Attribution-NonCommercial 4.0 International License.\n",
"\n",
"* Roman Empire 60 BC (provinces or extent)\n",
"* Roman Empire in AD 14 (provinces or extent)\n",
"* Roman Empire in AD 69 (provinces or extent)\n",
"* Roman Empire in AD 117 (DEFAULT, provinces or extent)\n",
"* Roman Empire in AD 200 (provinces or extent)\n",
"\n",
"_2. Roman Roads_\n",
"* McCormick, M. et al. 2013. \"Roman Road Network (version 2008),\" DARMC Scholarly Data Series, Data Contribution Series #2013-5. DARMC, Center for Geographic Analysis, Harvard University, Cambridge MA 02138.\n",
"\n",
"* [Ancient World Mapping Centre, road shapefiles](http://awmc.unc.edu/awmc/map_data/shapefiles/ba_roads/), shapefile for roads, following the Barington Atlas of Greek Roman World, under the Creative Commons Attribution-NonCommercial 4.0 International License. Collection of shapefiles also vailable through the UCD Digital Library\n",
"\n",
"_3. Cities_\n",
"\n",
"The shapefile of the cities used in the map is based on Hanson, J. W. (2016a). Cities Database (OXREP databases). Version 1.0. Accessed (date): http://oxrep.classics.ox.ac.uk/databases/cities/. DOI: https://doi.org/10.5287/bodleian:eqapevAn8. More info available through Hanson, J. W. (2016b). An Urban Geography of the Roman World, 100 B.C. to A.D. 300. Oxford: Archaeopress.\n",
"\n",
"\n",
"### Metadata for the files produced by *LatEpig*\n",
"\n",
"Each TSV and JSON file contains the information from EDCS separated into 22 attributes. The [*LatEpig* Metadata description](https://github.com/mqAncientHistory/Lat-Epig/blob/main/LatEpig_Metadata_Description.txt) document in the GitHub repo describes the contents of individual attributes along with their description and their original source. \n",
"\n",
"Note that the *file name* in both formats contains the date of your search, the source of the data and how it was accessed (EDCS via *LatEpig*) and any search parameters or their combinations you have used (*Term 1, Term 2, Dating from*...) and the number of inscriptions found. This way you will always remember what you have searched for and when you share the file with a colleague or students, they can always replicate your search to see if any new inscriptions were added to EDCS. The same applies to the publication quality maps (experimental feature) produced by *LatEpig*: they all contain the search parameters by default, alongside the origin of information and credit - all in the spirit of the best research practice and FAIR data principles.\n",
"\n",
"\n",
"### Generation of new attributes\n",
"\n",
"#### Customised cleaning of the epigraphic text and unit testing\n",
"\n",
"The text of the inscription is available in three different formats as three separate attributes: \n",
"1. ‘*inscription*’ - the original text as presented by EDCS with all original markup and symbols, including the Leiden Conventions markup for editions of inscriptions; \n",
"2. ‘*inscription_conservative_cleaning*’ - the result of the custom cleaning function embedded in the Lat/Epig software, producing a conservative version of the text of an inscription. The text is as close to the preserved state of the text, without restorations and expansions also known as the diplomatic edition (only the characters as they appear on the support, with minimal or no editorial intervention or interpretation)\n",
"3. ‘*inscription_interpretive_cleaning*’ - the result of the custom cleaning function embedded in the Lat/Epig software, producing an interpretative version of the text of an inscription. The text contains all restorations and expansions to obtain as rich a version of the text as possible, interpunction between sentences is not preserved. This text version is most suitable for quantitative text analysis methods and NLP.\n",
"\n",
"For details, see the structure of [both cleaning functions](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/text_parse.py) and [their unit tests](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/test_inscriptions.py).\n",
"\n",
"#### Unit tests\n",
"- for selected attributes, and overall functionality: [dates](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/test_dates.py), [language](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/test_language.py), [data access](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/test_scrape.py).\n",
"\n",
"---\n",
"\n",
"We welcome any feedback via the Github <a class=\"github-button\" href=\"https://github.com/mqAncientHistory/Lat-Epig/issues\" data-color-scheme=\"no-preference: light; light: light; dark: dark;\" data-icon=\"octicon-issue-opened\" data-show-count=\"true\" aria-label=\"Issue mqAncientHistory/Lat-Epig on GitHub\">Issue</a>!\n",
"\n",
"**Happy epigraphic explorations!**\n"
"\n",
"### Happy epigraphic explorations!\n"
]
},
{
Expand Down Expand Up @@ -511,9 +532,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ This program allows to extraction of the output of a search query from the [Epig

**Ballsun-Stanton B., Heřmánková P., Laurence R. *LatEpig* (version 2.0). GitHub. URL: <https://github.com/mqAncientHistory/Lat-Epig/> DOI: [10.5281/zenodo.5211341](https://doi.org/10.5281/zenodo.5211341)**

**License**: [GNU General Public License v3.0](https://github.com/mqAncientHistory/Lat-Epig/blob/main/LICENSE)

If you're using this tool in your research, <!-- Place this tag where you want the button to render. -->
<a class="github-button" href="https://github.com/mqAncientHistory/Lat-Epig" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-show-count="true" aria-label="Star mqAncientHistory/Lat-Epig on GitHub">Star</a> us on Github! (This way, we don't need to put tracking pixels into this notebook to get a sense of how many folks are using our tool!)

Expand Down Expand Up @@ -125,9 +127,9 @@ Note that the *file name* in both formats contains the date of your search, the
#### Customised cleaning of the epigraphic text and unit testing

The text of the inscription is available in three different formats as three separate attributes:
1. ‘inscription’ - the original text as presented by EDCS with all original markup and symbols, including the Leiden Conventions markup for editions of inscriptions;
2. ‘inscription_conservative_cleaning’ - the result of the custom cleaning function embedded in the Lat/Epig software, producing a conservative version of the text of an inscription. The text is as close to the preserved state of the text, without restorations and expansions also known as the diplomatic edition (only the characters as they appear on the support, with minimal or no editorial intervention or interpretation)
3. ‘inscription_interpretive_cleaning’ - the result of the custom cleaning function embedded in the Lat/Epig software, producing an interpretative version of the text of an inscription. The text contains all restorations and expansions to obtain as rich a version of the text as possible, interpunction between sentences is not preserved. This text version is most suitable for quantitative text analysis methods and NLP.
1.*inscription*’ - the original text as presented by EDCS with all original markup and symbols, including the Leiden Conventions markup for editions of inscriptions;
2.*inscription_conservative_cleaning*’ - the result of the custom cleaning function embedded in the Lat/Epig software, producing a conservative version of the text of an inscription. The text is as close to the preserved state of the text, without restorations and expansions also known as the diplomatic edition (only the characters as they appear on the support, with minimal or no editorial intervention or interpretation)
3.*inscription_interpretive_cleaning*’ - the result of the custom cleaning function embedded in the Lat/Epig software, producing an interpretative version of the text of an inscription. The text contains all restorations and expansions to obtain as rich a version of the text as possible, interpunction between sentences is not preserved. This text version is most suitable for quantitative text analysis methods and NLP.

For details, see the structure of [both cleaning functions](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/text_parse.py) and [their unit tests](https://github.com/mqAncientHistory/Lat-Epig/blob/main/src/lat_epig/test_inscriptions.py).

Expand Down

Large diffs are not rendered by default.

Loading

0 comments on commit df6f7dd

Please sign in to comment.