Feat/19 improve readability of the table format #42

zosiaboro · 2023-11-29T18:41:55Z

Added a nice_data attribute to Table to display a reformatted data table

Resolves #19

zosiaboro · 2023-11-29T18:54:14Z

I did not offer a parameter raw=False to choose between data tables as we suggested but rather added a new attribute. This is in line with how we call the raw_data and metadata but it could be good if this was the default.
I couldn't think of a neat way of adding this parameter but let me know if you think of another way of doing this :)

pmayd

Thanks for the PR and the work on this issue!

Sorry that I had so many columns but there is a lot of room for improvement and better naming as well as more performant code.

If you have any questions, feel free to ask me anytime

pmayd · 2023-12-03T20:03:06Z

src/pystatis/table.py

@@ -41,9 +41,40 @@ def get_data(self, area: str = "all", **kwargs):
        self.raw_data = raw_data
        data_str = StringIO(raw_data)
        self.data = pd.read_csv(data_str, sep=";")
+        self.nice_data = format_table(self.data)


You should not introduce a new class instance variable inside of methods, all instance variables should already be known or set in the __init__ method.
self.data is meant to hold the "pretty" data, because we also have self.raw_data which contains the raw data from the endpoint. So I think you can directly use self.data and make it more readable. We can control this via a variable in the get_data method like prettify=True which, by default, makes the data more readable.

If you introduce to many class instance variables you make it harder for the user to use the class because they no longer know which attribute to choose, is it raw_data, data or nice_data and what are the differences?

pmayd · 2023-12-03T20:04:58Z

src/pystatis/table.py


        metadata = load_data(
            endpoint="metadata", method="table", params=params, as_json=True
        )
        assert isinstance(metadata, dict)  # nosec assert_used
        self.metadata = metadata
+
+def format_table(data: pd.DataFrame, 


it is ok to put the method outside of the class, but you could also have chosen to make it a staticmethod inside the class so it is more clear that this method only works for the table data. From a method outside I would actually expect that it is also used somewhere else

I think we should find a better name of the function because you are not formatting the table, it is already formatted, in a flat csv file. What we actually do is reformatting the table to a better readable format, so you could think about something like reformat_table_data or prettify_table which is also very close to my suggested parameter prettify=True

pmayd · 2023-12-03T20:05:46Z

src/pystatis/table.py


        metadata = load_data(
            endpoint="metadata", method="table", params=params, as_json=True
        )
        assert isinstance(metadata, dict)  # nosec assert_used
        self.metadata = metadata
+
+def format_table(data: pd.DataFrame, 
+                ) -> pd.DataFrame:


you need to run the pre-commit pipeline over this, at least run black formatter, your style formatting is a little bit off like this empty bracket in a new line. We use the tool black to format the code properly and in the same way

pmayd · 2023-12-03T20:07:39Z

src/pystatis/table.py

+
+def format_table(data: pd.DataFrame, 
+                ) -> pd.DataFrame:
+    """Format the raw data into a more readable table


careful, you are not formatting the raw data but the data, so self.data, instead of self.raw_data, otherwise you would get a string and not the pandas DataFrame

pmayd · 2023-12-03T20:08:12Z

src/pystatis/table.py

+    """Format the raw data into a more readable table
+
+    Args:
+        data (pd.DataFrame): A pandas dataframe created with get_data()


instead of "created" it would be better to say "returned by" to make it clear that this is the return value of the mentioned method

pmayd · 2023-12-03T20:20:00Z

src/pystatis/table.py

+    time_name, = data["Zeit_Label"].unique()
+    time_values = data["Zeit"]
+
+    merkmal_labels = data.filter(like="Merkmal_Label").columns


I know that in Germany this is called "Merkmal" but it reads strange in an English code base, so maybe we could use the more appropriate "attributes" here or something like this also I admit that it is strange to have the German filter text as the like attribute...

pmayd · 2023-12-03T20:21:05Z

src/pystatis/table.py

+    indep_names = [data[name].unique()[0] for name in merkmal_labels] # list of column names from Merkmal_Label
+
+    auspraegung_labels = data.filter(like="Auspraegung_Label").columns
+    indep_values = [data[name] for name in auspraegung_labels] # list of data from Ausgepragung_Label


you are just getting a list of columns here right? Then it should be easier to do data[auspraegung_labels directly or even even just data.filter() which will give you the columns directly.

Again, instead of indep_values I think it would be nice to have something like merkmal_column_values

pmayd · 2023-12-03T20:23:48Z

src/pystatis/table.py

+    auspraegung_labels = data.filter(like="Auspraegung_Label").columns
+    indep_values = [data[name] for name in auspraegung_labels] # list of data from Ausgepragung_Label
+
+    dep_values = data.loc[:,auspraegung_labels[-1]:].iloc[:,1:] # get all columns after last Auspraegung column


this reads very strange, maybe you have to explain a little more why this is necessary? In any way a simple drop would probably work better and read much easier?

Or we could probably use the filter method again scanning for double underscore?

pmayd · 2023-12-03T20:25:30Z

src/pystatis/table.py

+
+    dep_values = data.loc[:,auspraegung_labels[-1]:].iloc[:,1:] # get all columns after last Auspraegung column
+    dep_names = [" ".join(name.split('_')[1:]) 
+                    for name in dep_values.columns] # splits strings in column names for readability


your comments are a little too generic. "splits strings in column names for readability" explains for one the obvious as the reader can see that you use the split method but it is not clear WHY you split this way. So a better comment would be to give a concrete example or explain why the first part of the name is omitted

And I think you are keeping the unit right? I think we should not do this but store this information somewhere else. So I would actually choose `[1:-1] to exclude both the code and the unit from the text.

AS comment something like this would be a good idea I guess:

# Given a name like BEV036__Bevoelkerung_in_Hauptwohnsitzhaushalten__1000 # we extract the readable label and omit both the code and the unit

Also I just notice that you split by _ but the separator is __ so you could simplify your code again by:
[name.split('__')[1] for name in dep_values.columns] and you don't need to join at all

pmayd · 2023-12-03T20:33:38Z

src/pystatis/table.py

+
+    nice_dict = {time_name:time_values, 
+                    **dict(zip(indep_names, indep_values)), 
+                    **dict(zip(dep_names, dep_values.values.T))}


it is strange that you have to call .values.T here, we should try to fix this at it seems very odd and probably can be solved more easily. It is simple to miss and hard to understand while reading. I would have to test the code to see why you did it but dep_values is a dataframe so you could just merge the two dataframes, for example, or use the dataframes to_dict method. So first overwrite the column names with dep_names and than export the dataframe into a dict using to_dict. You need no zip for this.

pmayd

LGTM

* Bump version to next major version #9 * Revert flake8 to ^3.0 for docstrings #9 * add a notebook that shows how to run init_config * Make dev dependencies optional, update lock and README #9 * Update workflow install --with dev, add matrix poetry version #9 * Fix python and poetry version definition #9 * Fix python and poetry version definition #9 * fix lock file * update dev dependencies and add python-dotenv to dev * improve readme * update readme * Feat/8 handle multiple databases and users (#20) * change config module to handle multiple databases * finalize work on config module to handle multiple databases; significantly reduced lines of code by getting rid of the settings.ini * add a new db module that serves as a layer between the user and the config. Can set the current active database and get the settings from the config * simplify config module * refactor code to implement new config; correct tests * fix all remaining tests * fix all text issues * update notebooks according to latest changes in config * drop support for Python 3.9 due to pipe operator for types and set supported versions to 3.10 and 3.11 * fix problem with config dir creation during setup * fix isort * Improve clear_cache output for full wipe, remove unused import * Address all non global-related pylint issues #20 * because of complexity get rid of the current support of custom config dir and always use the default config dir under user home. * fix all tests; get rid of settings.ini and functionality for user to define own config path; pystatis supports only default config path but custom data cache path * fix all tests; get rid of settings.ini and functionality for user to define own config path; pystatis supports only default config path but custom data cache path * refactor config module to work with a ConfigParser global config object instead of overwriting the config variable within the functions using global (bad style according to pylint) * address pylint issues * fix mypy issues * fix pylint issues --------- Co-authored-by: MarcoHuebner <marco_huebner1@gmx.de> * update README to the latest changes of multi database support * Added lists of all available statistics and tables * Feat/10 update and auto deploy sphinx (#27) * Updated dev-dependencies, added first version of Sphinx documentation, including built html documentation. * Added Logo, updated theme, updated GitHub workflow, fixed docstrings in cache and cube. Hosting on ReadTheDocs has to be done by Owner/ CorrelAid (but can be requested and triggered that way). * Updated urllib3 version, but everything <2.0.0 (deprecating `strict`) should be fine... * Updated poetry as recommended in cachecontrol issue report. * Fixed black formatting, fixed make docs (is now ran by poetry). * Fixed linting issue, updated packages, updated make docs. * Updated ReadMe, added developer sphinx documentation, added custom pre-commit hook and changed to hard-coded version in docs, added built documentation to artifacts, #3 * Add deployment workflow, needs Repo updates * Update depencies for Sphinx documentation #10 * Remove redundant docu information #10 Render parts of the README.md in the respective .rst files * Remove unused mdinclude, fix run-test py version, update pre-commit #10 * Fix dependency group for SPhinx workflow #10 * Fix docstring parameter rendering in Sphinx #10 * Fix image rendering by mimicking folder structure #10 * Add comment on warnings related to ext.napoleon #10 * Rename deploy-docs #10 * Fix black format issue in conf.py #10 * Update deploy key, add deploy trigger comment #10 * Update documentation deploy workflow #10 * Switch to matrix.os definition #10 * Fix pull_request target in deploy workflow #10 * Update poetry.lock #10 * Import package version to Sphinx docu #10 * Manually fix black formatting issue #10 * With auto-deploy working, decrease retention days #10 * Update readme and Sphinx header references #10 * Fix deploy to update files on the remote #10 * fix cube functionality: it seems like structure of QEI header part was changed as well as DQA no longer has information about axis so we assume that the order is preserved (#43) * add jupytext and new nb for presentation * Feat/35 implement regex matching (#44) * Implemented regex matching, initial commit * Added credentials check for cubes and removed all references to set_db() * Implemented regex matching, initial commit * Added credentials check for cubes and removed all references to set_db() * fix tests * refactoring Find and Result class to work with new database detection logic; because find does not use names like Table and Cube, use has to specify the database * fix tests --------- Co-authored-by: Michael Aydinbas <michael.aydinbas@gmail.com> Co-authored-by: Michael Aydinbas <michael.aydinbas@new-work.se> * add presentation nb * remove presentation nb for now * Feat/19 improve readability of the table format (#42) * Reformatting the raw data tables for readability * Adding comments * Applied suggested changes and run code formatting * add tests for Table --------- Co-authored-by: Michael Aydinbas <michael.aydinbas@new-work.se> * prepare Table so it can parse data from three different sources * Added description and examples of Find * implement parse logic for prettify zensus tables * fix pylint issues * edits on Find section * fixing overwritten changes * update presentation nb * add genesis parse code for regio, too, for the moment. * Feat/34 visualization examples (#48) * Add 02_Geo_visualization_example.ipynb * changed '-' to 0 instead of nan --> reproduce Simons result * new case study in visualization notebook, integration to presentation notebook * catch NA-values in read_csv and added Auspraegung_Code to table.py to have the unique region identifiers --------- Co-authored-by: jkrause <jkrause123@users.noreply.github.com> * final presentation nb and shape data; omit file check in pre-commit * fixed typo and beautified plots in presentation.ipynb /.py * add a first workaround for the new Zensus zip content type * fix all tests; separate Find and Results classes into own modules * update dependencies * update README * set version to 0.2 * remove Cubes from package for now; we no longer support cubes until they are requested * fix all tests; fix all relevant nb; * fix pylint issues * fix mypy issues * add documentation key * update changelog --------- Co-authored-by: MarcoHuebner <marco_huebner1@gmx.de> Co-authored-by: Pia <45008571+PiaSchroeder@users.noreply.github.com> Co-authored-by: MarcoHuebner <57489799+MarcoHuebner@users.noreply.github.com> Co-authored-by: zosiaboro <50183305+zosiaboro@users.noreply.github.com> Co-authored-by: Zosia Borowska <zofia.anna.borowska@gmail.com> Co-authored-by: jkrause123 <89632018+jkrause123@users.noreply.github.com> Co-authored-by: jkrause <jkrause123@users.noreply.github.com>

pmayd requested changes Dec 3, 2023

View reviewed changes

zosiaboro requested a review from pmayd January 15, 2024 16:22

pmayd approved these changes Jan 29, 2024

View reviewed changes

zosiaboro and others added 4 commits January 29, 2024 15:45

Reformatting the raw data tables for readability

f12830c

Adding comments

3363b5e

Applied suggested changes and run code formatting

59650f6

add tests for Table

92f7b16

pmayd force-pushed the feat/19-improve-readability-of-the-table-format branch from 4b9f51d to 92f7b16 Compare January 29, 2024 14:45

pmayd added 2 commits January 29, 2024 15:50

fix mocker patches in test_table

6a798db

implement monkeypatch to overwrite load_data for table tests

8217717

pmayd merged commit 0e2f0a2 into dev Jan 30, 2024
9 checks passed

pmayd deleted the feat/19-improve-readability-of-the-table-format branch January 30, 2024 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/19 improve readability of the table format #42

Feat/19 improve readability of the table format #42

zosiaboro commented Nov 29, 2023

zosiaboro commented Nov 29, 2023

pmayd left a comment

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd Dec 3, 2023

pmayd left a comment

Feat/19 improve readability of the table format #42

Feat/19 improve readability of the table format #42

Conversation

zosiaboro commented Nov 29, 2023

zosiaboro commented Nov 29, 2023

pmayd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmayd left a comment

Choose a reason for hiding this comment