Police-Data-Accessibility-Project · CaptainStabs · Aug 26, 2021 · Jul 1, 2021 · Jul 1, 2021 · Jul 1, 2021
@@ -9,19 +9,23 @@
 """
 SETUP HOW-TO:
     Step 1: Set webpage to the page you want to scrape.
-    Step 2: Click the links that lead to the files, and copy their paths. **NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
+    Step 2: Click the links that lead to the files, and copy their paths.
+            For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
+            **NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
             Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
-            Verify on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
+            Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
     Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
     Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
             otherwise, you can leave it blank.
     Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
-            Most departments do not seem to have a crawl-delay specified, so leave it at 5.
+            Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
     Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
             put the words that are **unique** to them.
     Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.
             "csv_dir" is better explained in the readme.
 
+        \* Verify this using your browser's developer pane using select element AKA Node Select
+
 EXAMPLE CONFIG:
     configs = {
         "webpage": "http://www.beverlyhills.org/departments/policedepartment/crimeinformation/crimestatistics/web.jsp",
@@ -35,6 +39,7 @@
     }
 """
 
+
 configs = {
     "webpage": "",
     "web_path": "",

@@ -9,14 +9,16 @@
 """
 SETUP HOW-TO:
     Step 1: Set webpage to the page you want to scrape.
-    Step 2: Click the links that lead to the files, and copy their paths. **NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
+    Step 2: Click the links that lead to the files, and copy their paths.
+            For example, http://www.beverlyhills.org/cbhfiles/storage/files/long_num/file.pdf would become /cbhfiles/storage/files/long_num/
+            **NOTE:** Ensure that files all match paths, otherwise remove a level until they match.
             Also ensure that domain stays the same (I've seen some sites use AWS buckets for one file and an on-site storage method for another)
-            Verify on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
+            Verify* on page that the href to the file contains the domain, if it doesn't, add the domain to domain.
     Step 3: If the domain is not in the href, set domain_included to False, otherwise set it to True
     Step 4: If you set domain_included to False, you need to add the domain (from the http(s) to the top level domain (TLD) (.com, .edu, etc),
             otherwise, you can leave it blank.
     Step 5: Set sleep_time to the desired integer. Best practice is to set it to the crawl-delay in a website's `robots.txt`.
-            Most departments do not seem to have a crawl-delay specified, so leave it at 5.
+            Most departments do not seem to have a crawl-delay specified, so leave it at 5 (If it's not there).
     Step 6: (Only applies to list_pdf_v3) If there are any documents that you *don't* want to scrape from the page,
             put the words that are **unique** to them.
     Step 7: "debug" is will make the scraper more verbose, but will generally be unhelpful to the average user. Leave False unless you're having issues.

@@ -0,0 +1,13 @@
+# Setup
+
+Within `configs.py`:
+1. Set `url` to the url
+1. set `department_code` to the first few letters of the url, and make them all capital. For example, the `department_code` of `https://hsupd.crimegraphics.com/2013/default.aspx` would be `HSUPD`.
+1. `list_header`, This shouldn't need any changing, as it's just translating the columns into our `Fields`
+
+# Module
+
+The `crimegraphics_scraper` module requires two arguments, the `configs`, and the `save_dir`. Should you want performance stats, add `stats=True` as an argument.
+
+# Info
+The scripts should likely be run daily. They will only save the data if the hash (generated from the table) is different. Otherwise, it will simply exit. 
@@ -0,0 +1,76 @@
+# Setup
+
+1. Clone the repo, either via the command line, `git clone https://github.com/Police-Data-Accessibility-Project/Scrapers.git` or from the website.
+2. `CD` into the `Scrapers` folder, and type `pip3 install -r requirements.txt`
+3. Copy the extractor version you need, and the `configs.py` to the `COUNTRY/STATE/COUNTY` that you created for the precinct.
+4. For example, Alameda county, California, would be placed into the folder `Scrapers/USA/CA/alameda/`.
+
+   This **MUST** be placed within the `Scrapers` folder that you downloaded. See [here](https://github.com/Police-Data-Accessibility-Project/Scrapers/tree/master/USA/CA/alameda) for the example.
+
+Open the `configs.py` file that you copied:
+1. Set `webpage` to the page with the pdf lists
+
+2. Open a few pdfs and get the common file path for them, and set that as `web_path`
+
+3. Set the `domain` to the beginning of the document host.
+
+4. On the page you want to scrape, open inspect element and using "Select an element", click the link to the pdf (once), and look at the element pane.
+
+   If the `href` tag looks like the following, (without the domain, just a path), add the common portion of the path. In this case, it's `/Portals/24/Booking Log/` (Spaces *should* be properly dealt with in the script, but if not, just replace it with `%20`)
+
+
+![image](https://user-images.githubusercontent.com/40151222/113303191-d5093200-92ce-11eb-8e42-0c23f70d9f47.png)
+
+Also, if the `href` tag does not have a slash in front of it, like the following picture, please add one.
+
+![image](https://user-images.githubusercontent.com/40151222/113487408-ffe9b680-9485-11eb-8942-b08fa7c1e528.png)
+
+Make sure to add a slash to the end of the `domain`.
+For example, `domain = "https://www.website.com"` would become `domain = "https://www.website.com/"`
+
+
+ If the site has a set crawler time under it's `robots.txt`, set `sleep_time` to it's value. Otherwise, just leave it at `5`
+
+If this does not make sense, try checking the comments within the code. (if you can find any)
+ Working example can be found [here](https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/USA/CA/fresno_county/college/fresno/fresno_daily_scraper.py)
+
+# Versions:
+`list_pdf_extractor.py` : most basic of the scripts, mostly used for reference
+
+`list_pdf_extractor_v2.py` : Uses imported `get_files` function. Useful for cases where a custom `get_files` is **not** needed. Function can be found [here](https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/common/utils/list_pdf_utils/get_files.py)
+
+`list_pdf_extractor_v3.py` : Built off of V2, Allows for filtering links by common unwanted words. ~~See [golden_west_scraper.py](https://github.com/CaptainStabs/Scrapers/blob/master/USA/CA/golden_west_college/golden_west_scraper.py) for working example.~~
+
+
+This script has two functions, the first, `extract_info`, extracts the links containing documents, and saves the url and the document name to a file called `links.txt`
+The second function, `get_files`, reads the link and name from `links.txt` and downloads the files.  
+
+#### Arguments:
+
+As the `list_pdf_scrapers` all use a common modules, they accept the same arguments.
+
+* `configs` : Required - comes with the template script, so no need to worry about it.
+* `save_dir` : Required - comes with the template script, so no need to worry about it.
+
+* `flavor` : Optional - Defaults to `stream`; (Used when `extract_tables` is True) accepted arguments are `stream` and `lattice`. Useful if the extracted data is jumbled (may not fix everything though).
+* `extract_tables` : Optional - Defaults to False; if set to True, will attempt to extract tables from pdfs using [Camelot](https://camelot-py.readthedocs.io/en/master/).
+
+The following 5 arguments are all passed to the `get_files` module. It's readme is located [here](https://github.com/Police-Data-Accessibility-Project/PDAP-Scrapers/blob/master/common/utils/list_pdf_utils/get_files_README.md)
+
+* `name_in_url` : Optional - Defaults to True; As the name implies, if the document name is **NOT** in the url/path or in the `href`
+
+* `extract_name` : Optional - Defaults to False; A different method of getting the document's name. Use if setting `name_in_url` to False did not work.
+
+    Example of when to use: the `url_name.txt` file simply has something like `documentID?5311351`. Only works if the `href` contains a string, like `<a href="/DocumentID?">Annual Report 2020</a>` (This is of course, a highly simplified tag, there will be a lot more clutter on actual websites)
+* `add_date` : Optional - Defaults to False; Use if a document is simply overwritten on a website without it's name being changed. Used in conjunction with `no_overwrite`
+* `try_overwite` : Optional - Defaults to False; Mostly deprecated, check with a Director before using. Instead use `no_overwrite`
+* `no_overwrite` : Optional - Defaults to False; Replaces `try_overwite`. Use in conjunction with `add_date`. As the name suggests, it prevents older documents from being overwritten, while still saving the new one if there are changes.
+
+##### Arguments unique to V3
+
+* `delete` : Optional - Defaults to True; if set to False, `url_name.txt` will not be deleted. This argument is also passed to `get_files` as it would delete the file once it was done with it.
+* `important` : Optional - Defaults to False; if there are more files that you *don't* want than you do, you can filter out and only select items containing the keywords by setting to True. If set to True, rename `non_important` in the configs to `important` (it will work either way but not without complaining if it can't find `important`)
+
+
+# More in depth explanations (Poorly explained, nerdy stuff)
+ `extract_info` uses `urllib` to open the webpage, and then `BeautifulSoup4` to parse it. It then uses regex to find all links that end with pdf or doc. It needs a few lines to be replaced with regex.
@@ -0,0 +1,52 @@
+# -*- coding: utf-8 -*-
+
+# Form implementation generated from reading ui file './common/gui/error_modal.ui'
+#
+# Created by: PyQt5 UI code generator 5.15.4
+#
+# WARNING: Any manual changes made to this file will be lost when pyuic5 is
+# run again.  Do not edit this file unless you know what you are doing.
+
+
+from PyQt5 import QtCore, QtGui, QtWidgets
+
+
+class Ui_error_dialog(object):
+    def setupUi(self, error_dialog):
+        error_dialog.setObjectName("error_dialog")
+        error_dialog.setWindowModality(QtCore.Qt.ApplicationModal)
+        error_dialog.resize(400, 300)
+        error_dialog.setModal(True)
+        self.buttonBox = QtWidgets.QDialogButtonBox(error_dialog)
+        self.buttonBox.setGeometry(QtCore.QRect(30, 240, 341, 32))
+        self.buttonBox.setOrientation(QtCore.Qt.Horizontal)
+        self.buttonBox.setStandardButtons(QtWidgets.QDialogButtonBox.Cancel|QtWidgets.QDialogButtonBox.Ok)
+        self.buttonBox.setObjectName("buttonBox")
+        self.label = QtWidgets.QLabel(error_dialog)
+        self.label.setGeometry(QtCore.QRect(70, 90, 271, 51))
+        font = QtGui.QFont()
+        font.setPointSize(13)
+        self.label.setFont(font)
+        self.label.setAutoFillBackground(True)
+        self.label.setScaledContents(True)
+        self.label.setWordWrap(True)
+        self.label.setObjectName("label")
+        self.label_2 = QtWidgets.QLabel(error_dialog)
+        self.label_2.setGeometry(QtCore.QRect(30, 40, 221, 61))
+        font = QtGui.QFont()
+        font.setPointSize(20)
+        font.setBold(True)
+        font.setWeight(75)
+        self.label_2.setFont(font)
+        self.label_2.setObjectName("label_2")
+
+        self.retranslateUi(error_dialog)
+        self.buttonBox.accepted.connect(error_dialog.accept)
+        self.buttonBox.rejected.connect(error_dialog.reject)
+        QtCore.QMetaObject.connectSlotsByName(error_dialog)
+
+    def retranslateUi(self, error_dialog):
+        _translate = QtCore.QCoreApplication.translate
+        error_dialog.setWindowTitle(_translate("error_dialog", "Dialog"))
+        self.label.setText(_translate("error_dialog", "You need to complete the first menu first"))
+        self.label_2.setText(_translate("error_dialog", "ERROR:"))
@@ -0,0 +1,121 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<ui version="4.0">
+ <class>error_dialog</class>
+ <widget class="QDialog" name="error_dialog">
+  <property name="windowModality">
+   <enum>Qt::ApplicationModal</enum>
+  </property>
+  <property name="geometry">
+   <rect>
+    <x>0</x>
+    <y>0</y>
+    <width>400</width>
+    <height>300</height>
+   </rect>
+  </property>
+  <property name="windowTitle">
+   <string>Dialog</string>
+  </property>
+  <property name="modal">
+   <bool>true</bool>
+  </property>
+  <widget class="QDialogButtonBox" name="buttonBox">
+   <property name="geometry">
+    <rect>
+     <x>30</x>
+     <y>240</y>
+     <width>341</width>
+     <height>32</height>
+    </rect>
+   </property>
+   <property name="orientation">
+    <enum>Qt::Horizontal</enum>
+   </property>
+   <property name="standardButtons">
+    <set>QDialogButtonBox::Cancel|QDialogButtonBox::Ok</set>
+   </property>
+  </widget>
+  <widget class="QLabel" name="label">
+   <property name="geometry">
+    <rect>
+     <x>70</x>
+     <y>90</y>
+     <width>271</width>
+     <height>51</height>
+    </rect>
+   </property>
+   <property name="font">
+    <font>
+     <pointsize>13</pointsize>
+    </font>
+   </property>
+   <property name="autoFillBackground">
+    <bool>true</bool>
+   </property>
+   <property name="text">
+    <string>You need to complete the first menu first</string>
+   </property>
+   <property name="scaledContents">
+    <bool>true</bool>
+   </property>
+   <property name="wordWrap">
+    <bool>true</bool>
+   </property>
+  </widget>
+  <widget class="QLabel" name="label_2">
+   <property name="geometry">
+    <rect>
+     <x>30</x>
+     <y>40</y>
+     <width>221</width>
+     <height>61</height>
+    </rect>
+   </property>
+   <property name="font">
+    <font>
+     <pointsize>20</pointsize>
+     <weight>75</weight>
+     <bold>true</bold>
+    </font>
+   </property>
+   <property name="text">
+    <string>ERROR:</string>
+   </property>
+  </widget>
+ </widget>
+ <resources/>
+ <connections>
+  <connection>
+   <sender>buttonBox</sender>
+   <signal>accepted()</signal>
+   <receiver>error_dialog</receiver>
+   <slot>accept()</slot>
+   <hints>
+    <hint type="sourcelabel">
+     <x>248</x>
+     <y>254</y>
+    </hint>
+    <hint type="destinationlabel">
+     <x>157</x>
+     <y>274</y>
+    </hint>
+   </hints>
+  </connection>
+  <connection>
+   <sender>buttonBox</sender>
+   <signal>rejected()</signal>
+   <receiver>error_dialog</receiver>
+   <slot>reject()</slot>
+   <hints>
+    <hint type="sourcelabel">
+     <x>316</x>
+     <y>260</y>
+    </hint>
+    <hint type="destinationlabel">
+     <x>286</x>
+     <y>274</y>
+    </hint>
+   </hints>
+  </connection>
+ </connections>
+</ui>