Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/19 improve readability of the table format #42

Merged
merged 6 commits into from
Jan 30, 2024

Conversation

zosiaboro
Copy link
Collaborator

Added a nice_data attribute to Table to display a reformatted data table

Resolves #19

@zosiaboro
Copy link
Collaborator Author

I did not offer a parameter raw=False to choose between data tables as we suggested but rather added a new attribute. This is in line with how we call the raw_data and metadata but it could be good if this was the default.
I couldn't think of a neat way of adding this parameter but let me know if you think of another way of doing this :)

Copy link
Collaborator

@pmayd pmayd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and the work on this issue!

Sorry that I had so many columns but there is a lot of room for improvement and better naming as well as more performant code.

If you have any questions, feel free to ask me anytime

@@ -41,9 +41,40 @@ def get_data(self, area: str = "all", **kwargs):
self.raw_data = raw_data
data_str = StringIO(raw_data)
self.data = pd.read_csv(data_str, sep=";")
self.nice_data = format_table(self.data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not introduce a new class instance variable inside of methods, all instance variables should already be known or set in the __init__ method.
self.data is meant to hold the "pretty" data, because we also have self.raw_data which contains the raw data from the endpoint. So I think you can directly use self.data and make it more readable. We can control this via a variable in the get_data method like prettify=True which, by default, makes the data more readable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you introduce to many class instance variables you make it harder for the user to use the class because they no longer know which attribute to choose, is it raw_data, data or nice_data and what are the differences?


metadata = load_data(
endpoint="metadata", method="table", params=params, as_json=True
)
assert isinstance(metadata, dict) # nosec assert_used
self.metadata = metadata

def format_table(data: pd.DataFrame,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok to put the method outside of the class, but you could also have chosen to make it a staticmethod inside the class so it is more clear that this method only works for the table data. From a method outside I would actually expect that it is also used somewhere else

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should find a better name of the function because you are not formatting the table, it is already formatted, in a flat csv file. What we actually do is reformatting the table to a better readable format, so you could think about something like reformat_table_data or prettify_table which is also very close to my suggested parameter prettify=True


metadata = load_data(
endpoint="metadata", method="table", params=params, as_json=True
)
assert isinstance(metadata, dict) # nosec assert_used
self.metadata = metadata

def format_table(data: pd.DataFrame,
) -> pd.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to run the pre-commit pipeline over this, at least run black formatter, your style formatting is a little bit off like this empty bracket in a new line. We use the tool black to format the code properly and in the same way


def format_table(data: pd.DataFrame,
) -> pd.DataFrame:
"""Format the raw data into a more readable table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

careful, you are not formatting the raw data but the data, so self.data, instead of self.raw_data, otherwise you would get a string and not the pandas DataFrame

"""Format the raw data into a more readable table

Args:
data (pd.DataFrame): A pandas dataframe created with get_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of "created" it would be better to say "returned by" to make it clear that this is the return value of the mentioned method

time_name, = data["Zeit_Label"].unique()
time_values = data["Zeit"]

merkmal_labels = data.filter(like="Merkmal_Label").columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in Germany this is called "Merkmal" but it reads strange in an English code base, so maybe we could use the more appropriate "attributes" here or something like this also I admit that it is strange to have the German filter text as the like attribute...

indep_names = [data[name].unique()[0] for name in merkmal_labels] # list of column names from Merkmal_Label

auspraegung_labels = data.filter(like="Auspraegung_Label").columns
indep_values = [data[name] for name in auspraegung_labels] # list of data from Ausgepragung_Label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are just getting a list of columns here right? Then it should be easier to do data[auspraegung_labels directly or even even just data.filter() which will give you the columns directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, instead of indep_values I think it would be nice to have something like merkmal_column_values

auspraegung_labels = data.filter(like="Auspraegung_Label").columns
indep_values = [data[name] for name in auspraegung_labels] # list of data from Ausgepragung_Label

dep_values = data.loc[:,auspraegung_labels[-1]:].iloc[:,1:] # get all columns after last Auspraegung column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads very strange, maybe you have to explain a little more why this is necessary? In any way a simple drop would probably work better and read much easier?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could probably use the filter method again scanning for double underscore?


dep_values = data.loc[:,auspraegung_labels[-1]:].iloc[:,1:] # get all columns after last Auspraegung column
dep_names = [" ".join(name.split('_')[1:])
for name in dep_values.columns] # splits strings in column names for readability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your comments are a little too generic. "splits strings in column names for readability" explains for one the obvious as the reader can see that you use the split method but it is not clear WHY you split this way. So a better comment would be to give a concrete example or explain why the first part of the name is omitted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think you are keeping the unit right? I think we should not do this but store this information somewhere else. So I would actually choose `[1:-1] to exclude both the code and the unit from the text.

AS comment something like this would be a good idea I guess:

# Given a name like BEV036__Bevoelkerung_in_Hauptwohnsitzhaushalten__1000
# we extract the readable label and omit both the code and the unit

Also I just notice that you split by _ but the separator is __ so you could simplify your code again by:
[name.split('__')[1] for name in dep_values.columns] and you don't need to join at all


nice_dict = {time_name:time_values,
**dict(zip(indep_names, indep_values)),
**dict(zip(dep_names, dep_values.values.T))}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange that you have to call .values.T here, we should try to fix this at it seems very odd and probably can be solved more easily. It is simple to miss and hard to understand while reading. I would have to test the code to see why you did it but dep_values is a dataframe so you could just merge the two dataframes, for example, or use the dataframes to_dict method. So first overwrite the column names with dep_names and than export the dataframe into a dict using to_dict. You need no zip for this.

@zosiaboro zosiaboro requested a review from pmayd January 15, 2024 16:22
Copy link
Collaborator

@pmayd pmayd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pmayd pmayd force-pushed the feat/19-improve-readability-of-the-table-format branch from 4b9f51d to 92f7b16 Compare January 29, 2024 14:45
@pmayd pmayd merged commit 0e2f0a2 into dev Jan 30, 2024
9 checks passed
@pmayd pmayd deleted the feat/19-improve-readability-of-the-table-format branch January 30, 2024 06:10
pmayd added a commit that referenced this pull request Feb 20, 2024
* Bump version to next major version #9

* Revert flake8 to ^3.0 for docstrings #9

* add a notebook that shows how to run init_config

* Make dev dependencies optional, update lock and README #9

* Update workflow install --with dev, add matrix poetry version #9

* Fix python and poetry version definition #9

* Fix python and poetry version definition #9

* fix lock file

* update dev dependencies and add python-dotenv to dev

* improve readme

* update readme

* Feat/8 handle multiple databases and users (#20)

* change config module to handle multiple databases

* finalize work on config module to handle multiple databases; significantly reduced lines of code by getting rid of the settings.ini

* add a new db module that serves as a layer between the user and the config. Can set the current active database and get the settings from the config

* simplify config module

* refactor code to implement new config; correct tests

* fix all remaining tests

* fix all text issues

* update notebooks according to latest changes in config

* drop support for Python 3.9 due to pipe operator for types and set supported versions to 3.10 and 3.11

* fix problem with config dir creation during setup

* fix isort

* Improve clear_cache output for full wipe, remove unused import

* Address all non global-related pylint issues #20

* because of complexity get rid of the current support of custom config dir and always use the default config dir under user home.

* fix all tests; get rid of settings.ini and functionality for user to define own config path; pystatis supports only default config path but custom data cache path

* fix all tests; get rid of settings.ini and functionality for user to define own config path; pystatis supports only default config path but custom data cache path

* refactor config module to work with a ConfigParser global config object instead of overwriting the config variable within the functions using global (bad style according to pylint)

* address pylint issues

* fix mypy issues

* fix pylint issues

---------

Co-authored-by: MarcoHuebner <marco_huebner1@gmx.de>

* update README to the latest changes of multi database support

* Added lists of all available statistics and tables

* Feat/10 update and auto deploy sphinx (#27)

* Updated dev-dependencies, added first version of Sphinx documentation, including built html documentation.

* Added Logo, updated theme, updated GitHub workflow, fixed docstrings in cache and cube. Hosting on ReadTheDocs has to be done by Owner/ CorrelAid (but can be requested and triggered that way).

* Updated urllib3 version, but everything <2.0.0 (deprecating `strict`) should be fine...

* Updated poetry as recommended in cachecontrol issue report.

* Fixed black formatting, fixed make docs (is now ran by poetry).

* Fixed linting issue, updated packages, updated make docs.

* Updated ReadMe, added developer sphinx documentation, added custom pre-commit hook and changed to hard-coded version in docs, added built documentation to artifacts, #3

* Add deployment workflow, needs Repo updates

* Update depencies for Sphinx documentation #10

* Remove redundant docu information #10

Render parts of the README.md in the respective .rst files

* Remove unused mdinclude, fix run-test py version, update pre-commit #10

* Fix dependency group for SPhinx workflow #10

* Fix docstring parameter rendering in Sphinx #10

* Fix image rendering by mimicking folder structure #10

* Add comment on warnings related to ext.napoleon #10

* Rename deploy-docs #10

* Fix black format issue in conf.py #10

* Update deploy key, add deploy trigger comment #10

* Update documentation deploy workflow #10

* Switch to matrix.os definition #10

* Fix pull_request target in deploy workflow #10

* Update poetry.lock #10

* Import package version to Sphinx docu #10

* Manually fix black formatting issue #10

* With auto-deploy working, decrease retention days #10

* Update readme and Sphinx header references #10

* Fix deploy to update files on the remote #10

* fix cube functionality: it seems like structure of QEI header part was changed as well as DQA no longer has information about axis so we assume that the order is preserved (#43)

* add jupytext and new nb for presentation

* Feat/35 implement regex matching (#44)

* Implemented regex matching, initial commit

* Added credentials check for cubes and removed all references to set_db()

* Implemented regex matching, initial commit

* Added credentials check for cubes and removed all references to set_db()

* fix tests

* refactoring Find and Result class to work with new database detection logic; because find does not use names like Table and Cube, use has to specify the database

* fix tests
---------

Co-authored-by: Michael Aydinbas <michael.aydinbas@gmail.com>
Co-authored-by: Michael Aydinbas <michael.aydinbas@new-work.se>

* add presentation nb

* remove presentation nb for now

* Feat/19 improve readability of the table format (#42)

* Reformatting the raw data tables for readability

* Adding comments

* Applied suggested changes and run code formatting

* add tests for Table

---------

Co-authored-by: Michael Aydinbas <michael.aydinbas@new-work.se>

* prepare Table so it can parse data from three different sources

* Added description and examples of Find

* implement parse logic for prettify zensus tables

* fix pylint issues

* edits on Find section

* fixing overwritten changes

* update presentation nb

* add genesis parse code for regio, too, for the moment.

* Feat/34 visualization examples (#48)

* Add 02_Geo_visualization_example.ipynb

* changed '-' to 0 instead of nan --> reproduce Simons result

* new case study in visualization notebook, integration to presentation notebook

* catch NA-values in read_csv and added Auspraegung_Code to table.py to have the unique region identifiers

---------

Co-authored-by: jkrause <jkrause123@users.noreply.github.com>

* final presentation nb and shape data; omit file check in pre-commit

* fixed typo and beautified plots in presentation.ipynb /.py

* add a first workaround for the new Zensus zip content type

* fix all tests; separate Find and Results classes into own modules

* update dependencies

* update README

* set version to 0.2

* remove Cubes from package for now; we no longer support cubes until they are requested

* fix all tests; fix all relevant nb;

* fix pylint issues

* fix mypy issues

* add documentation key

* update changelog

---------

Co-authored-by: MarcoHuebner <marco_huebner1@gmx.de>
Co-authored-by: Pia <45008571+PiaSchroeder@users.noreply.github.com>
Co-authored-by: MarcoHuebner <57489799+MarcoHuebner@users.noreply.github.com>
Co-authored-by: zosiaboro <50183305+zosiaboro@users.noreply.github.com>
Co-authored-by: Zosia Borowska <zofia.anna.borowska@gmail.com>
Co-authored-by: jkrause123 <89632018+jkrause123@users.noreply.github.com>
Co-authored-by: jkrause <jkrause123@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve readability of the table format
2 participants