-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add legal hold membership to device reporting #192
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ |
I have read the CLA Document and I hereby sign the CLA |
CHANGELOG.md
Outdated
@@ -34,6 +34,9 @@ how a consumer would use the library (e.g. adding unit tests, updating documenta | |||
- `search` to search for audit-logs. | |||
- `send-to` to send audit-logs to server. | |||
|
|||
- `code42 legal-hold show` option: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a new devices list
command that is merged by not yet released.
Would it make more sense to have a command code42 devices list --matter-id <UID>
instead?
Just want to make sure all the options are considered, I don't really understand the use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting point. I chose to add it to legal-hold show
because that command already showed details on a legal hold matter object, including users on legal hold and (optional) preservation policy details. I thought showing devices on legal hold would be a logical fit here.
From what I gather, the devices
command renders mostly the Computer
api resource. Unfortunately, there is no flag in the Computer
api response that states whether a device is on legal hold. In order to pull legal hold devices, you first need to pull legal hold membership to get a list of users and then find the devices associated with those users. It would not be impossible to move this into the devices
command, but we would still have to run some sort of legal hold py42 resource to determine the scope of devices.
Pinging @ceciliastevens for input as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a code42 legal-hold list-devices [MATTER-ID]
. Trying to find a way to afford output format options (At least table and csv).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, if we can find a way to structure this to add a format option, that would be helpful to most customers.
Currently, the legal-hold
show command outputs data in the following formats:
- Matter info - table
- Users - columns
- (optional) Preservation policy - JSON
What do you think about removing users from the show
command and creating code42 legal-hold list-users [Matter ID]
with an --include-devices
optiion? I think it makes sense to keep users and devices closely tied together, rather than in separate commands. From a code perspective, to get devices, you'll have to get users first, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that! @timabrmsn What are you thoughts on this? I believe you were the one who implemented the show command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think it makes more sense to add legal hold status as an option to the devices list
command, since that would be a useful option for people who are creating a report of all their devices to know which ones are on legal hold quickly when taking inventory of other things. It could just add a legalHoldMatterId
column to the output.
And keeping users under legal-hold show
still makes sense to me, since it's a high level overview of the matter, of which membership is a key detail. Otherwise the show
command just becomes a single row of the list
command with the option to also show preservation policy.
devices list
already has --include-usernames
which makes a separate call to api/Users
and then merges the DataFrames on userId
, so adding legal hold status would be easy enough to add using the same logic.
The storage total report is an interesting idea, but might be better suited to include in devices list
report also, so it can be used more generically and not tied directly to only legal hold. I'm thinking something like --include-storage-totals
on devices list
that just adds a column per device for the sum of all that device's archives. Then if someone wanted to get a report broken down by matter Id, they could just filter by matter_id and sum the total in the csv report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, devices
it is. The device pull actually includes userUid
without the username option, so it's even easier to merge in the legal hold data.
There are two main things that we want this to accomplish:
-
Limit the device list to devices on a specific legal hold matter
devices --matter-uid
- option to limit device search base on a specific legal hold matter, adds in columns such as legalHoldStatus, legalHoldMatterId, legalHoldMatterName -
Add in legal hold membership columns for whatever device query you're running.
devices -include-legal-hold-membership
- option to option to include legal hold membership on to the devices table, adds same columns as above but doesn't limit the population
This feels a little redundant, so any feedback is appreciated.
As for storage-totals
: is a sum of all device's archives appropriate? (this is how I wrote my first draft). However, the current legal hold script does some logic to determine which destination is the most recent and complete, and only uses that archive size as the total storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that since filtering by matter-id outside of the CLI is simple enough (i.e. grep/Select-String or directly in excel), can we leave that to the end-user? I'd like to not have our option list expand too much. And I have some ideas for more generic column/row filtering logic that could work on any cmd that outputs a dataframe. Something like `--filter 'matter-id == 42' that would convert to a pandas expression under the hood. Although I haven't started trying to implement that so I'm not sure how feasible it is...
Also, because a device can be part of multiple legal holds, we need to think of how to display that. Can we just have a single column that has a list of matter-uids the device is part of, maybe separated by semi-colons or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for storage totals I think adding a usedStorage
column that sums the total combined archive size for a device would work, maybe we could also add an archiveCount
column that gets added with that option?
Trying to keep what features we add as generic and multi-use-case friendly as possible.
src/code42cli/cmds/legal_hold.py
Outdated
) | ||
if len(devices_dataframe.index) > 0: | ||
echo("\nMatter Members and Devices:\n") | ||
click.echo(devices_dataframe.to_csv()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically use table format for show commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List commands, however, are known to support various output formats, inc. CSV.... My first comment on this PR suggests that this would make sense as a code42 devices list --matter-id <UID>
command, which then you can tack on -f CSV
or -f JSON
etc.... Would that be better?
tests/cmds/test_legal_hold.py
Outdated
@@ -191,6 +459,11 @@ def preservation_policy_response(mocker): | |||
return _create_py42_response(mocker, POLICY_RESPONSE) | |||
|
|||
|
|||
@pytest.fixture | |||
def devices_list_generator(mocker): | |||
return [TEST_DEVICE_PAGE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it is better to actually use generators for mocks when they are supposed to be rather than lists.. It saves future headaches, in my experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to use side_effect
instead of return_value
in that case, not sure
…tests, update CHANGELONG
I have committed new changes that remove the edits to
As for the |
Can we update the description please based on latest changes. I thought to try it out reading the description and then looking at the code it confused me more. |
src/code42cli/cmds/devices.py
Outdated
if df.empty: | ||
click.echo("No results found.") | ||
else: | ||
formatter = DataFrameOutputFormatter(format) | ||
formatter.echo_formatted_dataframe(df) | ||
|
||
|
||
def _add_legal_hold_membership_to_device_dataframe(sdk, df): | ||
matters = _get_all_active_matters(sdk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Kiran above that we probably shouldn't use the legal-hold module's internal functions, just for the sake of if someone later needs to change them it could break stuff here.
Since we only need to iterate through all membership dicts, we can simplify getting them all with a generator function instead of building a new list and extending it:
def _get_all_active_hold_memberships(sdk):
for page in sdk.legalhold.get_all_matters(active=True):
for matter in page["legalHolds"]:
for _page in sdk.legalhold.get_all_matter_custodians(
legal_hold_uid=matter["legalHoldUid"], active=True
):
for membership in _page["legalHoldMemberships"]:
yield membership
And pandas has a great function that can 'flatten' nested json into a dataframe for you, which can simplify a lot of this logic:
>>> df = pandas.json_normalize(_get_all_active_hold_memberships(sdk))
>>> df.head()
legalHoldMembershipUid active creationDate legalHold.legalHoldUid legalHold.name user.userUid user.username user.email user.userExtRef
0 955061003769628092 True 2020-05-20T15:58:38.823-05:00 955060991270602172 Test2 945056771151950748 test@example.com test@example.com None
1 988814086101244865 True 2021-01-08T11:25:23.645-06:00 946193444244471969 Test 973619347957360307 legalhold@code42.com legalhold@code42.com None
2 955320624671252917 True 2020-05-22T10:57:44.903-05:00 946193444244471969 Test 955163198918558214 de60@example.com de60@example.com None
3 955320624637698485 True 2020-05-22T10:57:44.892-05:00 946193444244471969 Test 955163197341499910 de59@example.com de59@example.com None
4 955320624604144053 True 2020-05-22T10:57:44.885-05:00 946193444244471969 Test 955163195865104902 de58@example.com de58@example.com None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json_normalize
is great for unpacking the nested dictionary. However, I seem to be losing records (1 per matter). If I run the generator response table through pandas.DataFrame
, I get a table with more rows than I do running the same generator response through json_normalize
. I haven't quite worked this out yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I'll see if I get the same behavior and try to figure out what's happening too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to serialize the generator into a list in order for all records to be included, which may defeat the purpose of the generator. I pushed those changes moments ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there was a recent pandas bug about the generator thing (they weren't expecting generators to be passed): pandas-dev/pandas#35923
It's fixed in 1.2.2, but I think we can probably just call list() on the result for now instead of force updating to the latest pandas.
src/code42cli/cmds/devices.py
Outdated
df.loc[ | ||
df["status"] == "Deactivated", | ||
["legalHoldMemberActive", "legalHoldUid", "legalHoldName"], | ||
] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using np.nan for null values here, let's use an empty string, I think NaN
would confuse people in the output and think maybe there was a bug. To replace any np.nan values resulting from the .groupby()
, you can call .fillna(value="")
on the df to replace those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose NaN
to match the value assumed by pandas of an empty value. NaN
's are generated when the legal hold table is merged with the devices table. If I use the above solution prior to merging with the devices table, the legalHoldUid and legalHold.name columns will contain a mix of None's, actual values, and NaNs.
To confirm: which of the following would you prefer?
- NaN's in table view
- None's in table view
(Either will produce null values in the other output formats)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I meant to say "resulting from the.merge()
" instead of .groupby()
. Yeah, the .fillna(value="")
needs to be called after the merge.
I think we should convert NaN/None values to an empty string for table/CSV outputs, but we'd probably want to keep them as NaN/None when we output to json so it translates to null
. But we should push that logic to the DataFrameOutputFormatter
so we only need to handle it in one place for any commands that use DataFrames.
I'll start a new PR to rework the formatter to handle that, since I'd like to refactor it a bit for clarity and add some tests around handling dataframe nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaNs removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've opened #245 to add the null handling in the DataFrameOutputFormatter, so we don't need to be concerned with converting them here, since we want the conversion to be conditional on the output format chosen by the end-user.
Sorry to have you flip/flop this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved, and awaiting #245
All PR comments should be resolved. (Waiting on NaN resolution in #245) The PR does not include a per-org storage calculation. The |
Hmm, it seems that the |
I'm not able to recreate that behavior - the option spits out backup usage data on my branch. Diffing with the upstream code42cli master doesn't show any changes to backup usage logic. Maybe something changed in py42? I'm using py42 v. 1.11.0 on my end |
Nope, it was just a device with a really large backup usage result towards the bottom of the list that was pushing that column way to the right. I think a separate option + column for total storage a device is using would be really helpful. I also am thinking But that can be a separate PR... I know a big part of the goal of this PR is to enable legal hold storage reporting, right? If we add an |
@maddie-vargo PR #245 just merged |
…of archives and archive bytes
|
I think this is good, just the small thing of filtering out V2 archives from the storage totals. I'll write up a ticket for testing with all the acceptance criteria and make sure I haven't missed anything else as I go through documenting each change. |
|
src/code42cli/cmds/devices.py
Outdated
@@ -466,8 +471,9 @@ def _break_backup_usage_into_total_storage(backup_usage): | |||
total_storage = 0 | |||
archive_count = 0 | |||
for archive in backup_usage: | |||
archive_count += 1 | |||
total_storage += archive["archiveBytes"] | |||
if archive["archiveFormat"] == "ARCHIVE_V1": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be if archive["archiveFormat"] != "ARCHIVE_V2":
, since while V3 archives are not prevalent, they still should show up in the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from change-log
comment LGTM.
CHANGELOG.md
Outdated
### Added | ||
|
||
- `code42 devices list` option: | ||
- `--include-legal-hold-membership` to add legal hold columns to the result output, printing the legal hold matter name and ID for any active device actively on legal hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rephrase printing ....
to something like prints the legalhold matter name and ID for any active device on legal hold
!?
atleast active device actively
seems off while reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is terrible phrasing. Resolved with latest commit
CHANGELOG.md
Outdated
@@ -58,6 +58,7 @@ how a consumer would use the library (e.g. adding unit tests, updating documenta | |||
- `search` to search for audit-logs. | |||
- `send-to` to send audit-logs to server. | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like our changelog has gotten messed up? The #Unreleased
tag should not be there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unparalleled-js I re-pulled from master. The #Unreleased tag should be at the top, right? Let me know if I should fix it with this PR
Description of Change
This PR addresses Phase I of #176 to add an
--include-devices
option to the legal-hold command to show devices associated with the legal hold custodians captured by the query. This option will also total storage held by these devices and print it in a table grouped by organization.This PR requires pandas as a dependency. Python 3.5 does not support the pandas library, so this should not be merged until support ends for 3.5.
Testing Procedure
PR Checklist
Did you remember to do the below?