Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User session tracking #1114

Closed

Conversation

chouseknecht
Copy link
Contributor

@chouseknecht chouseknecht commented Aug 29, 2018

Attempting to bring together #1111 and #1013 in a single solution to be flexible enough to capture any UI events we want to track, but initially focused on search.

Here's what it does so far:

  • Assigns a unique ID to each site visitor, even anonymous (or authenticated) visitors
  • Provides /api/v1/events/ endpoint for creating UI events
  • Endpoint accepts event_type and event_data as params, where event_data is a JSONField
  • Only accepts a defined list of event_type values, which for now only contains 'search'
  • Cleans and formats incoming JSONField data for 'search' events

For each user, each unique search is tracked, including query params, number of results returned, and subsequent clicks on next page, prev page, page size, and content item.

The resulting data structure for a search looks like the following:

{
            "id": 5,
            "url": "/api/v1/events/5/",
            "related": {},
            "summary_fields": {},
            "modified": "2018-08-30T21:55:01.141920Z",
            "session_id": "52722f71-d52c-456c-9cd4-ae2a29c11536",
            "event_type": "search",
            "event_data": {
                "search_results": 0,
                "search_params": {
                    "keywords": "nginx",
                    "deprecated": "false"
                },
                "next_page_clicks": 3,
                "prev_page_clicks": 2,
                "repositories_clicked": [
                    26290,
                    26763,
                    44081,
                    31119,
                    59668,
                    58767,
                    58741,
                    34933
                ],
                "results_clicked": [
                    12,
                    81,
                    85,
                    92,
                    1058,
                    1045,
                    1043,
                    1
                ],
                "content_items_clicked": [
                    7334,
                    7509,
                    16232,
                    9476,
                    26502,
                    25918,
                    25890,
                    1907
                ]
            },
            "active": null
        },

@chouseknecht chouseknecht force-pushed the feature/user-sessions branch 3 times, most recently from dfc6462 to 5880ecc Compare August 30, 2018 22:47
@chouseknecht chouseknecht changed the title [WIP] User session tracking User session tracking Aug 30, 2018
@chouseknecht chouseknecht force-pushed the feature/user-sessions branch from 5880ecc to 0f92181 Compare August 30, 2018 23:04
@chouseknecht chouseknecht force-pushed the feature/user-sessions branch from 0f92181 to 98924a6 Compare August 31, 2018 01:38
'summary_fields',
'created',
'modified',
'session_id',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer it if we didn't make user's actual session IDs available to the public via the API. It seems like these could be used to potentially link searches back to individual users, or make it easier for attackers to spoof other people's sessions.

I propose storing SHA hashes of the session IDs in the database instead. That way we can still aggregate events by a unique session identifier, but we can't look up session IDs that are linked to a specific user's session without knowing the real session ID, which should only be available to the individual user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree. Considering current implementation of ACL (SessionEventAccess class) it's a huge security issue.

def can_add(self, data):
return True

def can_change(self, obj, data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it so that you can only change an object if the session ID's match

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

'summary_fields',
'created',
'modified',
'session_id',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree. Considering current implementation of ACL (SessionEventAccess class) it's a huge security issue.

model = models.SessionEvent

def can_read(self, obj):
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to allow anybody access to our internal session tracking information?
I would say it should be push-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we change SessionEvent to store session hashes, does it matter if people can read the hashes?

def can_add(self, data):
return True

def can_change(self, obj, data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

class ValidSearchType(object):
choices = []

def __init__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you initialize static class field in the __init__ method?

class ValidSearchType(object):
    choices = [tp.value for tp in constances.EventType]

def __call__(self, value):
if value not in self.choices:
message = '%s is not a valid choice.' % value
raise drf_serializers.ValidationError(message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you could replace this whole class with simple function:

def valid_search_type(value):
    try:
        constants.EnumType(value)
   except ValueError:
       raise drf_serializers.ValidationError(...)

return {}

def update(self, instance, validated_data):
event_data = clean_search_data(copy.copy(instance.event_data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why copy is needed here?

}


def append_list_to_list(source_data, target_data, label):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what this function does. Docstring or some explanation is required.
Function name indicates common used util function, but instead it implements some context specific business logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants