Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper setup script #146

Merged
merged 106 commits into from
Aug 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
5098c7f
update list_pdf readme
CaptainStabs Jul 1, 2021
bbed322
update documentation
CaptainStabs Jul 1, 2021
984dc2b
this will be moved later
CaptainStabs Jul 1, 2021
51ce36a
moved to common/gui
CaptainStabs Jul 1, 2021
9e9a571
moved and renamed to scraper_ui
CaptainStabs Jul 1, 2021
01f34a5
generated code
CaptainStabs Jul 1, 2021
2b5db35
update size of combobox
CaptainStabs Jul 1, 2021
0d0c950
saving changes
CaptainStabs Jul 2, 2021
92eccdd
add error_modal
CaptainStabs Jul 2, 2021
d545531
pushing for eric
CaptainStabs Jul 2, 2021
4403e68
update ui
CaptainStabs Jul 2, 2021
4301b4c
add a few comments
CaptainStabs Jul 2, 2021
e5129d0
apply partial pep8 formatting
CaptainStabs Jul 2, 2021
9aaa9b3
fix backwards english logic
CaptainStabs Jul 2, 2021
3614201
Working PoC
CaptainStabs Jul 2, 2021
be9a8e2
Compiled version of scraper setup
CaptainStabs Jul 2, 2021
04bd101
Changed to edit the `configs` dictionary instead of using deprecated …
CaptainStabs Jul 3, 2021
f0dc8e6
remove old method of creating scarpers
CaptainStabs Jul 3, 2021
ab1fa38
Accidentally removed except statement
CaptainStabs Jul 3, 2021
59dfc47
update requirements
CaptainStabs Jul 3, 2021
b3e5c39
updated scrapersetup
CaptainStabs Jul 3, 2021
dd9aed7
update
CaptainStabs Jul 3, 2021
c41a40c
upload
CaptainStabs Jul 3, 2021
56837b5
merge into original scripts
CaptainStabs Jul 3, 2021
1f97ab5
removed testing stuff
CaptainStabs Aug 6, 2021
e6c1a6f
fix a lot of issues with the code
CaptainStabs Aug 7, 2021
627d72e
no longer needed
CaptainStabs Aug 7, 2021
66ba201
sanitize user input via .lower()
CaptainStabs Aug 7, 2021
af361ae
fix v3 not writing non_important
CaptainStabs Aug 7, 2021
77a00e9
make `choose_scraper_button` enable the setup page and switch to it.
CaptainStabs Aug 7, 2021
938e46b
Disable other scraperr's tabs on choose different type
CaptainStabs Aug 7, 2021
ad30d85
Initial commit
josh-chamberlain Aug 7, 2021
3c51812
sanitize CG input by lower()
CaptainStabs Aug 7, 2021
9b7db9b
Merge branch 'main' of https://github.com/Police-Data-Accessibility-P…
CaptainStabs Aug 8, 2021
a933fe2
apparently i wrote readmes for this branch as well
CaptainStabs Aug 8, 2021
eb8c9e6
working file version
CaptainStabs Aug 8, 2021
44e3832
merge scraper_setup0.py into scraper_setup.py
CaptainStabs Aug 8, 2021
0b5c226
no longer needed due to merge
CaptainStabs Aug 8, 2021
f0bdd05
merge v2 into scraper_ui.ui
CaptainStabs Aug 8, 2021
f3b91b5
old compile
CaptainStabs Aug 8, 2021
91019bb
use normal ui file
CaptainStabs Aug 8, 2021
8dd21cd
this was renamed ages ago in the repo
CaptainStabs Aug 8, 2021
bf886a2
fix sleep_time settings
CaptainStabs Aug 9, 2021
c60197e
country should be upper
CaptainStabs Aug 9, 2021
98ae210
fix typo causing duplicate of country
CaptainStabs Aug 9, 2021
1654a64
no need to lower it again
CaptainStabs Aug 9, 2021
b1b22bc
add commit
CaptainStabs Aug 9, 2021
3fb53ce
fix writing error
CaptainStabs Aug 9, 2021
491d358
add stuff
CaptainStabs Aug 9, 2021
6c927d8
move from pdap-scrapers
CaptainStabs Aug 9, 2021
faa64f6
move
CaptainStabs Aug 9, 2021
dc4cfd2
update readme
CaptainStabs Aug 9, 2021
b505a94
compile script
CaptainStabs Aug 9, 2021
1dac077
generate requirements
CaptainStabs Aug 9, 2021
3ab6157
remove unneeded dependencies
CaptainStabs Aug 10, 2021
9fc928e
copy out scrapersetup_windows
CaptainStabs Aug 10, 2021
a269313
rename to ScraperSetup
CaptainStabs Aug 10, 2021
221d347
not needed
CaptainStabs Aug 10, 2021
53402e5
Moved out, will become a release
CaptainStabs Aug 10, 2021
cb0c1b0
add version label in gui
CaptainStabs Aug 10, 2021
950ddcb
update readme
CaptainStabs Aug 10, 2021
3bb5d44
move to setup_gui folder
CaptainStabs Aug 10, 2021
bca383f
moved
CaptainStabs Aug 10, 2021
f9c1602
Merge branch 'main' of https://github.com/Police-Data-Accessibility-P…
CaptainStabs Aug 10, 2021
efec110
Merge remote-tracking branch 'origin/scraper-setup-script'
CaptainStabs Aug 10, 2021
b9216e7
Merge pull request #145 from Police-Data-Accessibility-Project/main-h…
CaptainStabs Aug 10, 2021
47e52f5
Accidentally used the common scripts instead of the base_scripts
CaptainStabs Aug 10, 2021
45a40cf
Update README.md
josh-chamberlain Aug 12, 2021
a315c89
Add comment denoting the scripts being auto-created
CaptainStabs Aug 12, 2021
35fe97b
this was renamed a long time ago
CaptainStabs Aug 12, 2021
26e3d14
Merge branch 'scraper-setup-script' of https://github.com/Police-Data…
CaptainStabs Aug 12, 2021
bdf9982
Fix for issue #148, added more comments
CaptainStabs Aug 12, 2021
4252851
Add scraper modal code
CaptainStabs Aug 12, 2021
9cc04a0
yeah this isn't working...
CaptainStabs Aug 12, 2021
4971c21
Comment out successdialog
CaptainStabs Aug 12, 2021
0801d7f
No need to have two
CaptainStabs Aug 12, 2021
f14c954
bump version
CaptainStabs Aug 12, 2021
7959995
fix comment
CaptainStabs Aug 12, 2021
63d7a52
rename scraper_setup to ScraperSetup
CaptainStabs Aug 12, 2021
70f74b3
add jmespath to import, add get_agency_info for future use, add (more…
CaptainStabs Aug 13, 2021
006c036
temporary for reference
CaptainStabs Aug 13, 2021
ed462b1
add schema stuff
CaptainStabs Aug 13, 2021
58698f5
add schema logic
CaptainStabs Aug 13, 2021
48c3f73
Add support for high resolution screens.
CaptainStabs Aug 16, 2021
3790b11
Fix scaling issues on high DPI/resolution screens
CaptainStabs Aug 16, 2021
b1f12ee
Fix labels overlaping on high DPI displays
CaptainStabs Aug 16, 2021
bb293a4
move to the `setup_gui` folder
CaptainStabs Aug 16, 2021
b4ed698
add better json search
CaptainStabs Aug 16, 2021
21560b6
Make dolt request actually fill out table
CaptainStabs Aug 16, 2021
ffa7dc6
enable all other create tabs to switch to the schema tab
CaptainStabs Aug 16, 2021
b5709ca
add part 2 of schema stuff, error messages
CaptainStabs Aug 16, 2021
015858a
update error messages, force stylesheet reset after every create func…
CaptainStabs Aug 16, 2021
0eb27c5
update qtablewidget
CaptainStabs Aug 16, 2021
c33d7ed
globalize scraper_save_dir
CaptainStabs Aug 16, 2021
93b5a90
add for eventual copying
CaptainStabs Aug 16, 2021
62b2c6e
add for push
CaptainStabs Aug 16, 2021
0cd2966
not entirely sure what this is changing
CaptainStabs Aug 16, 2021
f652306
felt cute, might delete later
CaptainStabs Aug 22, 2021
7609b13
add/fix copying schema files, add schema edit code
CaptainStabs Aug 22, 2021
bada422
Merge branch 'main' into scraper-setup-script
CaptainStabs Aug 22, 2021
2a74810
add `scraper_path` to setup_gui's copy of the schema
CaptainStabs Aug 22, 2021
7452188
schema creation is complete
CaptainStabs Aug 22, 2021
a2da77f
add value to `scraper_path` to prevent errors
CaptainStabs Aug 22, 2021
4db5c01
decided to deletee
CaptainStabs Aug 22, 2021
2058e35
Bump version
CaptainStabs Aug 22, 2021
5e06939
Update readme to reflect new names
CaptainStabs Aug 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions Base_Scripts/Scrapers/list_pdf_extractors/list_pdf_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,23 @@
"""
SETUP HOW-TO:
Step 1: Set webpage to the page you want to scrape.
Step 2: Click the links that lead to the files, and copy their paths. **NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
Step 2: Click the links that lead to the files, and copy their paths.
For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
**NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
Verify on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
otherwise, you can leave it blank.
Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
Most departments do not seem to have a crawl-delay specified, so leave it at 5.
Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
put the words that are **unique** to them.
Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
"csv_dir" is better explained in the readme.

\* Verify this using your browser's developer pane using select element AKA Node Select

EXAMPLE CONFIG:
configs = {
"webpage": "http://www.beverlyhills.org/departments/policedepartment/crimeinformation/crimestatistics/web.jsp",
Expand All @@ -35,6 +39,7 @@
}
"""


configs = {
"webpage": "",
"web_path": "",
Expand Down
8 changes: 5 additions & 3 deletions Base_Scripts/Scrapers/list_pdf_extractors/list_pdf_v3.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@
"""
SETUP HOW-TO:
Step 1: Set webpage to the page you want to scrape.
Step 2: Click the links that lead to the files, and copy their paths. **NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
Step 2: Click the links that lead to the files, and copy their paths.
For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
**NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
Verify on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
otherwise, you can leave it blank.
Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
Most departments do not seem to have a crawl-delay specified, so leave it at 5.
Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
put the words that are **unique** to them.
Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Setup

Within `configs.py`:
1. Set `url` to the url
1. set `department_code` to the first few letters of the url, and make them all capital. For example, the `department_code` of `https://hsupd.crimegraphics.com/2013/default.aspx` would be `HSUPD`.
1. `list_header`, This shouldn't need any changing, as it's just translating the columns into our `Fields`

# Module

The `crimegraphics_scraper` module requires two arguments, the `configs`, and the `save_dir`. Should you want performance stats, add `stats=True` as an argument.

# Info
The scripts should likely be run daily. They will only save the data if the hash (generated from the table) is different. Otherwise, it will simply exit.
76 changes: 76 additions & 0 deletions common/base_scrapers/list_pdf_scrapers/list_pdf_readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Setup

1. Clone the repo, either via the command line, `git clone https://github.com/Police-Data-Accessibility-Project/Scrapers.git` or from the website.
2. `CD` into the `Scrapers` folder, and type `pip3 install -r requirements.txt`
3. Copy the extractor version you need, and the `configs.py` to the `COUNTRY/STATE/COUNTY` that you created for the precinct.
4. For example, Alameda county, California, would be placed into the folder `Scrapers/USA/CA/alameda/`.

This **MUST** be placed within the `Scrapers` folder that you downloaded. See [here](https://github.com/Police-Data-Accessibility-Project/Scrapers/tree/master/USA/CA/alameda) for the example.

Open the `configs.py` file that you copied:
1. Set `webpage` to the page with the pdf lists

2. Open a few pdfs and get the common file path for them, and set that as `web_path`

3. Set the `domain` to the beginning of the document host.

4. On the page you want to scrape, open inspect element and using "Select an element", click the link to the pdf (once), and look at the element pane.

If the `href` tag looks like the following, (without the domain, just a path), add the common portion of the path. In this case, it's `/Portals/24/Booking Log/` (Spaces *should* be properly dealt with in the script, but if not, just replace it with `%20`)


![image](https://user-images.githubusercontent.com/40151222/113303191-d5093200-92ce-11eb-8e42-0c23f70d9f47.png)

Also, if the `href` tag does not have a slash in front of it, like the following picture, please add one.

![image](https://user-images.githubusercontent.com/40151222/113487408-ffe9b680-9485-11eb-8942-b08fa7c1e528.png)

Make sure to add a slash to the end of the `domain`.
For example, `domain = "https://www.website.com"` would become `domain = "https://www.website.com/"`


If the site has a set crawler time under it's `robots.txt`, set `sleep_time` to it's value. Otherwise, just leave it at `5`

If this does not make sense, try checking the comments within the code. (if you can find any)
Working example can be found [here](https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/USA/CA/fresno_county/college/fresno/fresno_daily_scraper.py)

# Versions:
`list_pdf_extractor.py` : most basic of the scripts, mostly used for reference

`list_pdf_extractor_v2.py` : Uses imported `get_files` function. Useful for cases where a custom `get_files` is **not** needed. Function can be found [here](https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/common/utils/list_pdf_utils/get_files.py)

`list_pdf_extractor_v3.py` : Built off of V2, Allows for filtering links by common unwanted words. ~~See [golden_west_scraper.py](https://github.com/CaptainStabs/Scrapers/blob/master/USA/CA/golden_west_college/golden_west_scraper.py) for working example.~~


This script has two functions, the first, `extract_info`, extracts the links containing documents, and saves the url and the document name to a file called `links.txt`
The second function, `get_files`, reads the link and name from `links.txt` and downloads the files.

#### Arguments:

As the `list_pdf_scrapers` all use a common modules, they accept the same arguments.

* `configs` : Required - comes with the template script, so no need to worry about it.
* `save_dir` : Required - comes with the template script, so no need to worry about it.

* `flavor` : Optional - Defaults to `stream`; (Used when `extract_tables` is True) accepted arguments are `stream` and `lattice`. Useful if the extracted data is jumbled (may not fix everything though).
* `extract_tables` : Optional - Defaults to False; if set to True, will attempt to extract tables from pdfs using [Camelot](https://camelot-py.readthedocs.io/en/master/).

The following 5 arguments are all passed to the `get_files` module. It's readme is located [here](https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/common/utils/list_pdf_utils/get_files_README.md)

* `name_in_url` : Optional - Defaults to True; As the name implies, if the document name is **NOT** in the url/path or in the `href`

* `extract_name` : Optional - Defaults to False; A different method of getting the document's name. Use if setting `name_in_url` to False did not work.

Example of when to use: the `url_name.txt` file simply has something like `documentID?5311351`. Only works if the `href` contains a string, like `<a href="/DocumentID?">Annual Report 2020</a>` (This is of course, a highly simplified tag, there will be a lot more clutter on actual websites)
* `add_date` : Optional - Defaults to False; Use if a document is simply overwritten on a website without it's name being changed. Used in conjunction with `no_overwrite`
* `try_overwite` : Optional - Defaults to False; Mostly deprecated, check with a Director before using. Instead use `no_overwrite`
* `no_overwrite` : Optional - Defaults to False; Replaces `try_overwite`. Use in conjunction with `add_date`. As the name suggests, it prevents older documents from being overwritten, while still saving the new one if there are changes.

##### Arguments unique to V3

* `delete` : Optional - Defaults to True; if set to False, `url_name.txt` will not be deleted. This argument is also passed to `get_files` as it would delete the file once it was done with it.
* `important` : Optional - Defaults to False; if there are more files that you *don't* want than you do, you can filter out and only select items containing the keywords by setting to True. If set to True, rename `non_important` in the configs to `important` (it will work either way but not without complaining if it can't find `important`)


# More in depth explanations (Poorly explained, nerdy stuff)
`extract_info` uses `urllib` to open the webpage, and then `BeautifulSoup4` to parse it. It then uses regex to find all links that end with pdf or doc. It needs a few lines to be replaced with regex.
52 changes: 52 additions & 0 deletions common/gui/error_modal.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# -*- coding: utf-8 -*-

# Form implementation generated from reading ui file './common/gui/error_modal.ui'
#
# Created by: PyQt5 UI code generator 5.15.4
#
# WARNING: Any manual changes made to this file will be lost when pyuic5 is
# run again. Do not edit this file unless you know what you are doing.


from PyQt5 import QtCore, QtGui, QtWidgets


class Ui_error_dialog(object):
def setupUi(self, error_dialog):
error_dialog.setObjectName("error_dialog")
error_dialog.setWindowModality(QtCore.Qt.ApplicationModal)
error_dialog.resize(400, 300)
error_dialog.setModal(True)
self.buttonBox = QtWidgets.QDialogButtonBox(error_dialog)
self.buttonBox.setGeometry(QtCore.QRect(30, 240, 341, 32))
self.buttonBox.setOrientation(QtCore.Qt.Horizontal)
self.buttonBox.setStandardButtons(QtWidgets.QDialogButtonBox.Cancel|QtWidgets.QDialogButtonBox.Ok)
self.buttonBox.setObjectName("buttonBox")
self.label = QtWidgets.QLabel(error_dialog)
self.label.setGeometry(QtCore.QRect(70, 90, 271, 51))
font = QtGui.QFont()
font.setPointSize(13)
self.label.setFont(font)
self.label.setAutoFillBackground(True)
self.label.setScaledContents(True)
self.label.setWordWrap(True)
self.label.setObjectName("label")
self.label_2 = QtWidgets.QLabel(error_dialog)
self.label_2.setGeometry(QtCore.QRect(30, 40, 221, 61))
font = QtGui.QFont()
font.setPointSize(20)
font.setBold(True)
font.setWeight(75)
self.label_2.setFont(font)
self.label_2.setObjectName("label_2")

self.retranslateUi(error_dialog)
self.buttonBox.accepted.connect(error_dialog.accept)
self.buttonBox.rejected.connect(error_dialog.reject)
QtCore.QMetaObject.connectSlotsByName(error_dialog)

def retranslateUi(self, error_dialog):
_translate = QtCore.QCoreApplication.translate
error_dialog.setWindowTitle(_translate("error_dialog", "Dialog"))
self.label.setText(_translate("error_dialog", "You need to complete the first menu first"))
self.label_2.setText(_translate("error_dialog", "ERROR:"))
121 changes: 121 additions & 0 deletions common/gui/error_modal.ui
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
<?xml version="1.0" encoding="UTF-8"?>
<ui version="4.0">
<class>error_dialog</class>
<widget class="QDialog" name="error_dialog">
<property name="windowModality">
<enum>Qt::ApplicationModal</enum>
</property>
<property name="geometry">
<rect>
<x>0</x>
<y>0</y>
<width>400</width>
<height>300</height>
</rect>
</property>
<property name="windowTitle">
<string>Dialog</string>
</property>
<property name="modal">
<bool>true</bool>
</property>
<widget class="QDialogButtonBox" name="buttonBox">
<property name="geometry">
<rect>
<x>30</x>
<y>240</y>
<width>341</width>
<height>32</height>
</rect>
</property>
<property name="orientation">
<enum>Qt::Horizontal</enum>
</property>
<property name="standardButtons">
<set>QDialogButtonBox::Cancel|QDialogButtonBox::Ok</set>
</property>
</widget>
<widget class="QLabel" name="label">
<property name="geometry">
<rect>
<x>70</x>
<y>90</y>
<width>271</width>
<height>51</height>
</rect>
</property>
<property name="font">
<font>
<pointsize>13</pointsize>
</font>
</property>
<property name="autoFillBackground">
<bool>true</bool>
</property>
<property name="text">
<string>You need to complete the first menu first</string>
</property>
<property name="scaledContents">
<bool>true</bool>
</property>
<property name="wordWrap">
<bool>true</bool>
</property>
</widget>
<widget class="QLabel" name="label_2">
<property name="geometry">
<rect>
<x>30</x>
<y>40</y>
<width>221</width>
<height>61</height>
</rect>
</property>
<property name="font">
<font>
<pointsize>20</pointsize>
<weight>75</weight>
<bold>true</bold>
</font>
</property>
<property name="text">
<string>ERROR:</string>
</property>
</widget>
</widget>
<resources/>
<connections>
<connection>
<sender>buttonBox</sender>
<signal>accepted()</signal>
<receiver>error_dialog</receiver>
<slot>accept()</slot>
<hints>
<hint type="sourcelabel">
<x>248</x>
<y>254</y>
</hint>
<hint type="destinationlabel">
<x>157</x>
<y>274</y>
</hint>
</hints>
</connection>
<connection>
<sender>buttonBox</sender>
<signal>rejected()</signal>
<receiver>error_dialog</receiver>
<slot>reject()</slot>
<hints>
<hint type="sourcelabel">
<x>316</x>
<y>260</y>
</hint>
<hint type="destinationlabel">
<x>286</x>
<y>274</y>
</hint>
</hints>
</connection>
</connections>
</ui>
Loading