diff --git a/README.md b/README.md
index a30ac55..d95d573 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,10 @@
### simeon
-`simeon` is a CLI tool to help with the processing of edx Research data. It can `list`, `download`, and `split` edX data packages. It can also `push` the output of the `split` subcommand to both GCS and BigQuery. It is heavily inspired by the [edx2bigquery
-](https://github.com/mitodl/edx2bigquery) package. If you've used that tool, you should be able to navigate the quirks that may come with this one.
+`simeon` is a CLI tool to help with the processing of edx Research data.
+It can `list`, `download`, and `split` edX data packages. It can also `push` the output of the `split` subcommand to both GCS and BigQuery.
+It is heavily inspired by the [edx2bigquery](https://github.com/mitodl/edx2bigquery) package. If you've used that tool, you should be able to navigate the quirks that may come with this one.
-### Installing with pip
+### Installing from pypi
```sh
python3 -m pip install simeon
# Or with geoip
@@ -12,19 +13,19 @@ python3 -m pip install simeon[geoip]
simeon --help
```
-### Installing with git clone
+### Installing with git clone and pip
```sh
git clone git@github.com:MIT-IR/simeon.git
-cd simeon && python -m pip install .
+cd simeon && python3 -m venv venv && source venv/bin/activate && python -m pip install .
# Or with geoip
-cd simeon && python -m pip install .[geoip]
+cd simeon && python3 -m venv venv && source venv/bin/activate && python -m pip install .[geoip]
# Then invoke the CLI tool with
simeon --help
```
### Using Docker
```sh
-docker run -it mitir/simeon:latest
+docker run --rm -it mitir/simeon:latest
simeon --help
```
@@ -34,7 +35,7 @@ git clone git@github.com:MIT-IR/simeon.git
cd simeon
# Set up a virtual environment if you don't already have on
python3 -m venv venv
-. venv/bin/activate
+source venv/bin/activate
# pip install the package in an editable way
python3 -m pip install -e .[test,geoip]
# Invoke the executable
@@ -46,7 +47,7 @@ tox
### Setups and configurations
-`simeon` is a glorified downloader and uploader set of scripts. Much of the downloading and uploading that it does makes the assumptions that you have your AWS credentials configured properly and that you've got a service account file for GCP services available on your machine. If the latter is missing, you may have to authenticate to GCP services through the SDK. However, both we and Google recommend you not do that.
+`simeon` is a glorified downloader and uploader set of scripts. Much of the downloading and uploading that it does make the assumptions that you have your AWS credentials configured properly and that you've got a service account file for GCP services available on your machine. If the latter is missing, you may have to authenticate to GCP services through the SDK. However, both we and Google recommend you not do that.
Every downloaded file is decrypted either during the download process or while it gets split by the `simeon split` command. So, this tool assumes that you've installed and configured `gpg` to be able to decrypt files from edX.
@@ -57,7 +58,7 @@ The following steps may be useful to someone just getting started with the edX d
- Configure both AWS and gpg, so your credentials can access the S3 buckets and your `gpg` key can decrypt the files there
2. Setup a GCP project
- Create a GCP project
- - Setup a BigQuery workspace
+ - Set up a BigQuery workspace
- Create a GCS bucket
- Create a service account and download the associated file
- Give the service account Admin Role access to both the BigQuery project and the GCS bucket
@@ -105,7 +106,7 @@ The options in the config file(s) should match the optional arguments of the CLI
```sh
# List the latest SQL bundle
simeon list -s edx -o mitx -f sql -L
- # List the laetst email data dump
+ # List the latest email data dump
simeon list -s edx -o mitx -f email -L
# List the latest tracking log file
simeon list -s edx -o mitx -f log -L
@@ -189,21 +190,15 @@ The options in the config file(s) should match the optional arguments of the CLI
1. Please note that SQL bundles are quite large when split up, so consider using the `-c` or `--courses` option when invoking `simeon download -S` or `simeon split` to make sure that you limit the splitting to a set of course IDs. You may also use the `--clistings-file` option, which expects a txt file of course IDs; one ID per line.
If the aforementioned options are not used, `simeon` may end up failing to complete the split operation due to exhausted system resources (storage to be specific).
-
2. `simeon download` with file types `log` and `email` will both download and decrypt the files matching the given criteria. If the latter operations are successful, then the encrypted files are deleted by default. This is to make sure that you don't exhaust storage resources. If you wish to keep those files, you can always use the `--keep-encrypted` option that comes with `simeon download` and `simeon split`.
SQL bundles are only downloaded (not decrypted). Their decryption is done during a `split` operation.
-
3. Unless there is an unhandled exception (which should be reported as a bug), `simeon` should, by default, print to the standard output both information and errors encountered while processing your files. You can capture those logs in a file by using the global option `--log-file` and providing a destination file for the logs.
-
4. When using multi argument options like `--tables` or `--courses`, you should try not to place them right before the expected positional arguments. This will help the CLI parser not confuse your positional arguments with table names (in the case of `--tables`) or course IDs (when `--courses` is used).
-
5. Splitting tracking logs is a resource intensive process. The routine that splits the logs generates a file for each course ID encountered. If you happen to have more course IDs in your logs than the running process can open operating system file descriptors, then `simeon` will put away records it can't save to disk for a second pass. Putting away the records involves using more memory than normally required. The second pass will only require one file descriptor at a time, so it should be safe in terms of file descriptor limits. To help `simeon` not have to do a second pass, you may increase the file descriptor limits of processes from your shell by running something like `ulimit -n 2000` before calling `simeon split` on Unix machines. For Windows users, you may have to dig into the Windows Registries for a corresponding setting. This should tell your OS kernel to allow OS processes to open up to 2000 file handles.
-
6. Care must be taken when using `simeon split` and `simeon push` to make sure that the number of positional arguments passed does not lead to the invoked command exceeding the maximum command-line length allowed for arguments in a command. To avoid errors along those lines, please consider passing the positional arguments as UNIX glob patterns. For instance, `simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'` tells `simeon` to expand the given glob pattern, instead of relying on the shell to do it.
-
-7. The `report` subcommand relies on the presence of SQL query files to parse and send to BigQuery to execute. Any errors arising from executing the parsed queries will be shown to the end user through the given log stream. While the `simeon` tool ships with query files for most secondary/reporting tables that are based on the `edx2bigquery` tool, an end user should be able to point `simeon` to a different location with SQL query files by using the `--query-dir` option that comes with `simeon report`. Additionally, these query files can contain [`jinja2 templated`](https://jinja.palletsprojects.com/en/latest/) SQL code. Any mentioned variables within these templated queries can be passed to `simeon report` by using the `--extra-args` option and passing key-value pair items in the format `var1=value1,var2=value2,var3=value3,...,varn=valuen`. Further, these key-value pair items can also be typed by using the format `var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen`. In this format, the type is append to the key, separated by a colon. The only supported scalar types, so far, are `s` for `str`, `i` for `int`, and `f` for `float`. If any conversion errors occur during value parsing, then those are shown to the end user, and the query won't get executed. Finally, if you wish to pass an `array` or `list` to the template, you will need to repeat a key multiple times. For instance, if you want to pass a list named `mylist` containing the integers, you could write something like `--extra-args mylist:i=1,mylist:i=2,mylist:i=3`. This means that you'll have a python `list` named `mylist` within your template, and it should contain `[1, 2, 3]`.
+7. The `report` subcommand relies on the presence of SQL query files to parse and send to BigQuery to execute. Any errors arising from executing the parsed queries will be shown to the end user through the given log stream. While the `simeon` tool ships with query files for most secondary/reporting tables that are based on the `edx2bigquery` tool, an end user should be able to point `simeon` to a different location with SQL query files by using the `--query-dir` option that comes with `simeon report`. Additionally, these query files can contain [jinja2 templated](https://jinja.palletsprojects.com/en/latest/) SQL code. Any mentioned variables within these templated queries can be passed to `simeon report` by using the `--extra-args` option and passing key-value pair items in the format `var1=value1,var2=value2,var3=value3,...,var_n=value_n`. Further, these key-value pair items can also be typed by using the format `var1:i=value1,var2:s=value2,var3:f=value3,...,var_n:s=value_n`. In this format, the type is appended to the key, separated by a colon. The only supported scalar types, so far, are `s` for `str`, `i` for `int`, and `f` for `float`. If any conversion errors occur during value parsing, then those are shown to the end user, and the query won't get executed. Finally, if you wish to pass an `array` or `list` to the template, you will need to repeat a key multiple times. For instance, if you want to pass a list named `mylist` containing the integers, you could write something like `--extra-args mylist:i=1,mylist:i=2,mylist:i=3`. This means that you'll have a python `list` named `mylist` within your template, and it should contain `[1, 2, 3]`. You can also pass a JSON file whose top-level objects are parsed as variables. Use a leading `@` when passing a JSON file.
diff --git a/README.rst b/README.rst
index f5398f2..79134aa 100644
--- a/README.rst
+++ b/README.rst
@@ -9,8 +9,8 @@ BigQuery. It is heavily inspired by the
you’ve used that tool, you should be able to navigate the quirks that
may come with this one.
-Installing with pip
-~~~~~~~~~~~~~~~~~~~
+Installing from pypi
+~~~~~~~~~~~~~~~~~~~~
.. code:: sh
@@ -20,15 +20,15 @@ Installing with pip
# Then invoke the CLI tool with
simeon --help
-Installing with git clone
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Installing with git clone and pip
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: sh
git clone git@github.com:MIT-IR/simeon.git
- cd simeon && python -m pip install .
+ cd simeon && python3 -m venv venv && source venv/bin/activate && python -m pip install .
# Or with geoip
- cd simeon && python -m pip install .[geoip]
+ cd simeon && python3 -m venv venv && source venv/bin/activate && python -m pip install .[geoip]
# Then invoke the CLI tool with
simeon --help
@@ -37,7 +37,7 @@ Using Docker
.. code:: sh
- docker run -it mitir/simeon:latest
+ docker run --rm -it mitir/simeon:latest
simeon --help
Developing
@@ -49,7 +49,7 @@ Developing
cd simeon
# Set up a virtual environment if you don't already have on
python3 -m venv venv
- . venv/bin/activate
+ source venv/bin/activate
# pip install the package in an editable way
python3 -m pip install -e .[test,geoip]
# Invoke the executable
@@ -85,7 +85,7 @@ the edX data package:
2. Setup a GCP project
- Create a GCP project
- - Setup a BigQuery workspace
+ - Set up a BigQuery workspace
- Create a GCS bucket
- Create a service account and download the associated file
- Give the service account Admin Role access to both the BigQuery
@@ -154,7 +154,7 @@ and end dates), and site (``edx`` or ``edge`` or ``patches``).
# List the latest SQL bundle
simeon list -s edx -o mitx -f sql -L
- # List the laetst email data dump
+ # List the latest email data dump
simeon list -s edx -o mitx -f email -L
# List the latest tracking log file
simeon list -s edx -o mitx -f log -L
@@ -306,11 +306,11 @@ Notes:
``edx2bigquery`` tool, an end user should be able to point ``simeon``
to a different location with SQL query files by using the
``--query-dir`` option that comes with ``simeon report``.
- Additionally, these query files can contain
- ```jinja2 templated`` `__
- SQL code. Any mentioned variables within these templated queries can
- be passed to ``simeon report`` by using the ``--extra-args`` option
- and passing key-value pair items in the format
+ Additionally, these query files can contain `jinja2
+ templated `__ SQL code.
+ Any mentioned variables within these templated queries can be passed
+ to ``simeon report`` by using the ``--extra-args`` option and passing
+ key-value pair items in the format
``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further,
these key-value pair items can also be typed by using the format
``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000..16014ad
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,58 @@
+[build-system]
+requires = ["setuptools>=59.6.0", "setuptools-scm"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "simeon"
+authors = [
+ {name = "MIT Institutional Research", email = "irx@mit.edu"},
+]
+description = "A CLI tool to help process research data from edX"
+readme = "README.rst"
+requires-python = ">=3.6"
+keywords = ["research", "edx", "MOOC", "education", "online-learning"]
+license = {text = "MIT License"}
+classifiers = [
+ "Development Status :: 4 - Beta",
+ "Environment :: Console",
+ "Intended Audience :: Science/Research",
+ "Natural Language :: English",
+ "License :: OSI Approved :: MIT License",
+ "Programming Language :: Python",
+ "Programming Language :: Python :: 3",
+ "Programming Language :: Python :: 3.6",
+ "Programming Language :: Python :: 3.7",
+ "Programming Language :: Python :: 3.8",
+ "Programming Language :: Python :: 3.9",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.11",
+ "Programming Language :: Python :: Implementation :: CPython",
+ "Topic :: Text Processing",
+]
+dependencies = [
+ "boto3>=1.16.57",
+ "google-cloud-bigquery>=2.6.2",
+ "google-cloud-storage>=1.35.0",
+ "jinja2",
+ "python-dateutil>=2.8.1",
+]
+dynamic = ["version"]
+
+[project.optional-dependencies]
+geoip = ["geoip2"]
+test = ["black", "isort", "pip-tools", "sphinx", "tox"]
+dev = ["black", "isort", "pip-tools", "sphinx", "tox"]
+
+[tool.setuptools.package-data]
+"simeon.upload" = ["schemas/*.json"]
+"simeon.report" = ["queries/*.sql"]
+"simeon.scripts" = ["data/*.csv"]
+
+[project.scripts]
+simeon = "simeon.scripts.simeon:main"
+simeon-geoip = "simeon.scripts.geoip:main"
+simeon-youtube = "simeon.scripts.youtube:main"
+
+[project.urls]
+"Homepage" = "https://github.com/MIT-IR/simeon"
+"Bug Tracker" = "https://github.com/MIT-IR/simeon/issues"
diff --git a/requirements_dev.txt b/requirements_dev.txt
index 99c7097..8e8d6ee 100644
--- a/requirements_dev.txt
+++ b/requirements_dev.txt
@@ -1,2 +1,4 @@
+black
+isort
Sphinx>=3.4.3
tox>=3.21.2
diff --git a/setup.py b/setup.py
index 7ac4f78..87573f8 100644
--- a/setup.py
+++ b/setup.py
@@ -1,62 +1,69 @@
import os
+
from setuptools import find_packages, setup
# Get the value of __version__ from the library's __init__.py file
-exec(open(os.path.join('simeon', '__init__.py')).read())
+exec(open(os.path.join("simeon", "__init__.py")).read())
setup(
- name='simeon',
- version=globals().get('__version__', '0.0.24'),
- author='MIT Institutional Research',
- author_email='irx@mit.edu',
- packages=find_packages(exclude=('docs',)),
- url='https://github.com/MIT-IR/simeon',
- license='MIT LICENSE',
+ name="simeon",
+ version=globals().get("__version__", "0.0.25"),
+ author="MIT Institutional Research",
+ author_email="irx@mit.edu",
+ packages=find_packages(exclude=("docs",)),
+ url="https://github.com/MIT-IR/simeon",
+ license="MIT LICENSE",
keywords=[
- 'edx research data', 'mitx', 'edx',
- 'MOOC', 'education', 'online learning'
+ "edx research data",
+ "mitx",
+ "edx",
+ "MOOC",
+ "education",
+ "online learning",
],
- python_requires='>=3.6',
- description='A CLI tool to help process research data from edX',
- long_description=open('README.rst').read(),
- # include_package_data=True,
+ python_requires=">=3.6",
+ description="A CLI tool to help process research data from edX",
+ long_description=open("README.rst").read(),
entry_points={
- 'console_scripts': [
- 'simeon=simeon.scripts.simeon:main',
- 'simeon-geoip=simeon.scripts.geoip:main',
- 'simeon-youtube=simeon.scripts.youtube:main',
+ "console_scripts": [
+ "simeon=simeon.scripts.simeon:main",
+ "simeon-geoip=simeon.scripts.geoip:main",
+ "simeon-youtube=simeon.scripts.youtube:main",
],
},
install_requires=[
- 'boto3>=1.16.57',
- 'google-cloud-bigquery>=2.6.2',
- 'google-cloud-storage>=1.35.0',
- 'jinja2',
- 'python-dateutil>=2.8.1',
+ "boto3>=1.16.57",
+ "google-cloud-bigquery>=2.6.2",
+ "google-cloud-storage>=1.35.0",
+ "jinja2",
+ "python-dateutil>=2.8.1",
],
extras_require={
- 'geoip': ['geoip2'],
- 'test': ['sphinx', 'tox'],
+ "geoip": ["geoip2"],
+ "test": ["black", "isort", "pip-tools", "sphinx", "tox"],
+ "dev": ["black", "isort", "pip-tools", "sphinx", "tox"],
},
package_data={
- 'simeon.upload': ['schemas/*.json'],
- 'simeon.report': ['queries/*.sql'],
- 'simeon.scripts': ['data/*.csv'],
+ "simeon.upload": ["schemas/*.json"],
+ "simeon.report": ["queries/*.sql"],
+ "simeon.scripts": ["data/*.csv"],
},
test_suite="simeon.tests",
classifiers=[
- 'Development Status :: 4 - Beta',
- 'Environment :: Console',
- 'Intended Audience :: Science/Research',
- 'Natural Language :: English',
- 'License :: OSI Approved :: MIT License',
- 'Programming Language :: Python',
- 'Programming Language :: Python :: 3',
- 'Programming Language :: Python :: 3.6',
- 'Programming Language :: Python :: 3.7',
- 'Programming Language :: Python :: 3.8',
- 'Programming Language :: Python :: 3.9',
- 'Programming Language :: Python :: Implementation :: CPython',
- 'Topic :: Text Processing',
+ "Development Status :: 4 - Beta",
+ "Environment :: Console",
+ "Intended Audience :: Science/Research",
+ "Natural Language :: English",
+ "License :: OSI Approved :: MIT License",
+ "Programming Language :: Python",
+ "Programming Language :: Python :: 3",
+ "Programming Language :: Python :: 3.6",
+ "Programming Language :: Python :: 3.7",
+ "Programming Language :: Python :: 3.8",
+ "Programming Language :: Python :: 3.9",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: Implementation :: CPython",
+ "Topic :: Text Processing",
],
)