Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve history #2

Open
karec opened this issue May 11, 2015 · 10 comments
Open

Improve history #2

karec opened this issue May 11, 2015 · 10 comments

Comments

@karec
Copy link
Collaborator

karec commented May 11, 2015

For the moment, the history feature is very basic and not very powerful, I think we can improve it by adding some features :

  • Keep full history even if previous is used
  • Add a next feature
  • keep trace for every url opened with all methods (in case of submit form, following links, etc...)
@maowen
Copy link
Contributor

maowen commented May 19, 2015

I'd be happy to work on this feature. It would help me to clarify a few things first.

  1. Can you clarify what you mean by "Keep full history even if previous is used"? I understand that to mean that a chronological list or all GET responses should be stored in the Browser._history list.
  2. I propose that the Browser._history object be modified to contain the requests.Response object instead of the URL. This could then function as a cache of the GET responses and would make the behavior of the Browser.back method behave more like a real browser.
  3. I propose that the Browser.back method not reload the page/request. The new behavior would be to iterate backward through the Browser._history list of GET request.Response objects and set the appropriate Browser variables to make the previous page the active page (This would not modify the history unless the active page was re-requested).
  4. I propose that a Browser.reload or Browser.refresh method be implemented that would re-request the active page (this would append the response of the re-request to the history list).
  5. Can you clarify what you mean by "Add a next feature"? Would this do the reverse of the Browser.back method (move forward one response in the history and make that response the active page)? I would implement this as Browser.foward.
  6. Can you clarify what the trace should contain? I understand this to be a list or all Request.response objects for all HTTP requests (GET, POST, PUT, etc.) similar to the Browser._history list but for all HTTP request methods.

Let me know if you have any feedback and then I'll start making the changes.

@karec
Copy link
Collaborator Author

karec commented May 19, 2015

Hi maowen ! Happy to see you again !

I've already make a work around for the historic, not released yet only on github, you can check it here: https://github.com/karec/oct-browser/blob/master/octbrowser/history.py

But it's only a very basic work around

I will try to answer your questions :

  1. Actually the previous method is very basic, and if it's used you can not move forward, that's not realy usefull
  2. It is a great idea , but I fear that for some projects, the memory usage is too high, for example in the OCT project , the goal is to open a very large amount of content. But I think we can make it simpler and more modular : if we use an object as historical , we can create an abstract class, and implement the historic class several times , for example a simple who keep a list URLs, let's says BasicHistoric, and advanced that keeps complete response from requests, let's says CachedHistoric
  3. Same as 2
  4. Great idea, and very simple to implement
  5. This is it ! Sorry for my english, not perfect yet ^^'
  6. Same as 2

I hope that was helpful, but I think that we can make a mix of those ideas, and make the browser more extensible. I realy think we should implement the history property as a class, and let's the user create a whole new historic class if needed. Plus we can implement multiples historic in the core, we just need to define an abstract class and inherit from it. I will take a look at it and update my work for the base

@maowen
Copy link
Contributor

maowen commented May 20, 2015

Karec,

Thanks for answering my questions! I see now that you already started history.py. Sorry I didn't notice before. I still have a couple more questions. I'm new to collaboration on github. Let me know if you'd prefer I address my questions some other way (e.g. just make the changes I think are best and submit a pull request).

  • If we change the History class to use a deque with a maxlen this would provide the user an easy way to limit the amount of memory used by the history cache. Do you think it's best to make the default mode of the core browser history to only store the url string? Or, would this be an adequate enough solution to allow the history to always cache the request.Reponse objects? I like the idea of caching the last n request.Reponse objects in history because it would make octbrowser behave very similar to a real browser (no refresh when going forward/back in history). If you believe it is best to only store the url string in history by default that is how I'll implement it. Just let me know what you decide. Here is the change that would to allow for storing the request.Response with memory usage constraints:
from collections import deque

MAXLEN = 10 # or MAXLEN = None for unlimited deque length

class History(object):
    def __init__(self, maxlen=MAXLEN):
        self.current = 0
        self.history = deque(maxlen=maxlen)

    ...
  • It would help me to define the interface between the elements of History.history and the core browser.Browser. Currently, the browser expects the history items to be url strings. If we also wish to support a CachedHistory abstract class then in that case the history items would likely be request.Response objects. So my question is, would it be better to have browser.back/browser.forward handle the cases for different types of history elements or should we define required properties of a history element? I prototyped the code for the two different case below:
  • Implementation that explicitly handles each supported history item type:
    def back(self):
        """Implementation that explicitly handles each supported history item type"""
        try:
            item = self._history[-1]
            # String url history item
            if issubclass(item, str):
                resp = self.open_url(item, back=True)
            # requests.Response history item
            elif issubclass(item, requests.Response):
                self._url = item.url
                resp = self._parse_html(item)
                return resp
            else:
                raise Exception('Invalid history item')
            return resp
        except IndexError:
            raise Exception("No history, cannot go back")
  • Implementation that requires history item to have a item.url attribute:
    def back(self):
        """
        Implementation that requires history item to have a url attribute

        If the history item contains the cache content as well then it won't require 
        a reload of the page.
        """
        try:
            item = self._history[-1]
            try:
                url = item.url
            except AttributeError:
                raise Exception('History item must have url attribute')
            self._url = item
            try:
                resp = self._parse_html(item)
            except AttributeError:
                resp = self.open_url(item, back=True)
                resp = self._parse_html(resp)
            return resp
        except IndexError:
            raise Exception("No history, cannot go back")
  • If you like the idea of requiring the History items to have a item.url attribute we could probably implement it as a HistoryItem class that pretty much only has the url property and then make an abstract class CachedHistoryItem that inherits from both HistoryItem and request.Response.

Thanks for letting me contribute and being so welcoming! Also, your english is very good!

Mark

@karec
Copy link
Collaborator Author

karec commented May 20, 2015

maowen,

Thanks again for your work !

I think github issue is the right tool for discussing about a feature, so we can keep a trace of all our work around in the same place, but if you want to show more complex code example, you can use gists ! (https://gist.github.com/) It's pretty simple to use and more comfortable than markdown ;)
Pull requests are great too ! But since we are in the reflection part for this feature, I think the issue is more appropriate :)

I will try to answer your questions :

  • I really think that the idea of a CachedHistory is a very good solution, plus, your solution for limited the number of items in history is really good ! But I think we should also let the user decide, and even more, create his own history system ! That could be great for example if you use a library for requests that already cache the response. In that case why store again the response objects ? But I agree with you, my work around for this history isn't permissive enough for extending or create new historic and I will update it ASAP.
  • I think we can do even simpler ! The solution of returning a request.Response object seems perfect, event for a historic that only use url ! Let me explain it :

The octbrowser already depends on the requests library, so we can say that an implementation of the HistoryBaseClass (for example) must return a request.Response object in all case, and if it only stores the url, the historic will request that url itself and return the response.
With this logic the browser back and forward methods will be pretty simple, for example we can have this for the back method :

def back(self):
    """Basic implementation for back method of the browser
    """
    response = self._history.back()
    parsed_response = self._parse_html(response)
    self._url = parsed_response.url

So that's it, three lines, and if the historic raise an exception, the user can catch it in his own code ! And if the historic does not return a correct request.Response object, an exception will be raised

What do you think of this solution ?

  • Fortunately the request.Response object already has an url property

Thanks for your contribution ! I really appreciate !

Karec

@maowen
Copy link
Contributor

maowen commented May 21, 2015

Karec,

Yes! I do like the concept of requiring the history to return a request.Response object. I also think we should take it one step further and define both the inputs and outputs of the History class to be request.Reponse objects/sublcasses. For example:

    def open_url(self, url, data=None, back=False, **kwargs):
        """Open the given url"""
        if data:
            # No history is stored for POST requests
            response = self.session.post(url, data, **kwargs)
            self._url = url
        else:
            response = self.session.get(url, **kwargs)
            if not back:
                assert issubclass(response, request.Response) # Unnecessary
                self._history.append_item(response)
            self._url = response.url
        response = self._parse_html(response)
        response.connection.close()
        return response

    def back(self):
        """Basic implementation for back method of the browser"""
        response = self._history.back()
        assert issubclass(response, request.Response) # Or let the user catch any exceptions
        parsed_response = self._parse_html(response)
        self._url = parsed_response.url

Using request.Response objects for both the inputs and outputs I believe is a very good way to architect the browser interface with the History class.

The other thing I think we should define is the default behavior of the octbrowser history. It will be great to use a HistoryBase class so users can override the default History implementation to fit there needs. I personally would like to have the default behavior be as similar to a real browser as possible.

These are the two potential solutions that I can think of that provide browser-like default behavior. The third option is simple but is not guaranteed to provide a good representation of the historic request response:

  1. Cache the full request.Response objects in the history and return them as
    appropriate when the browser.back/forward methods are called.
  2. Cache the request.Request or request.PreparedRequest objects (these are contained inside of the request.Response). When the user calls the back/forward method, the history class resubmits the cached request.Request in a GET request and returns the resulting request.Response.
  3. Cache the request.Response.url in the history and return the reloaded content using requests.get(url) when the history.back/forward methods are called.

The problem I see with only saving the URL in the default history is that there
will no way to correctly reproduce the GET requests from the history. This is because the GET requests include the session.headers, session.cookies, and any request specific headers (e.g. browser.open_url('http://github.com', headers={'Referer': 'http://www.google.com'})). In my opinion all of these should be cached so the history is an accurate representation of historic GET requests.

Which solution (1, 2, or 3) do you think is best for the default history behavior? I suppose the beauty of the design is it is very easy for the user to change between any of those history implementations list above or any others that they desired. Implementing the history as a HistoryBase class really is a great idea!

One last question, how do you want me to help implement these changes? I'm happy to make the changes in both browser.py and/or history.py as well as adding the appropriate tests. Just let me know where I can be helpful.

Thanks!

maowen

karec added a commit that referenced this issue May 21, 2015
… a BaseHistory class + allow browser._history to be set to None
@karec
Copy link
Collaborator Author

karec commented May 21, 2015

@maowen,

I've committed a little work around for the history, you can check the commit for all details but here is the summary :

  • New structure for octbrowser folder, history now has its own folder
  • I've created the BaseHistory class. As you suggested the BaseHistory will use requests.Reponse objects for both inputs and outputs
  • Also the history property can now be set to None in the browser and I've added the associated exceptions. So if _history is set to None but a method that require it is called, then the HistoryIsNone exception will be raised. But with this, users that don't need history can just set it to None and the open_url method + the submit_form methods won't call the _history.append_item
  • I've removed the back parameter in the open_url method, since the history object will take care of open urls / return response objects, I think it's a nonsense to keep that parameter

I think that's it ! It would be really great if you can take care of the CachedHistory, and I agree with you : it must be the default behavior for the history :)

Also for its implementation I think your 1st proposition is the best : cache the full requests.Response objects and return them.

I think the UrlOnly historic can wait for the moment since it's not a typical browser historic but can be helpful for generating report of all pages visited for example. And with this pattern it will be pretty easy to add new History classes !

I will continue to work on the browser side / BaseHistory and try to make regular commits for it

Thank again for your contribution !

Karec

@maowen
Copy link
Contributor

maowen commented May 30, 2015

Karec,

Do you think that making the CachedHistory class the default mode for the browser is a good idea? Or, do you prefer that history 'None' is the default? If we did make CachedHistory the default mode, one approach I can think of would be to remove history=None from the required Browser initializer args. For example,

from octbrowser.history.cached import CachedHistory

class Browser(object):
    def __init__(self, session=None, base_url='', **kwargs):
        try:
            self._history = kwargs['history']
        except KeyError:
            self._history = CachedHistory()

What do you think?

maowen

@karec
Copy link
Collaborator Author

karec commented May 31, 2015

@maowen

I think that making the CachedHistory as the default mode is the best solution for the browser !

I like your example, but instead of use a try...except block, I think we can use the .get() method on kwargs, like this :

def __init__(self, session=None, base_url='', **kwargs):
    self._history = kwargs.get('history', CachedHistory())

What do you think of this solution ?

Karec

@maowen
Copy link
Contributor

maowen commented May 31, 2015

Karec,

I like it! That's a much cleaner way of doing it.

I noticed that the documentation needs to be updated some to reflect the changes to we've made. If it's okay with you I'll work on improving the docs some this week.

maowen

@karec
Copy link
Collaborator Author

karec commented Jun 1, 2015

@maowen,

It would be a pleasure ! But if you don't feel to write it I can do it too :)

I've already made some updates to some docstrings in the code, mostly adding parameter type.
It would be really great to update the documentation before making a new release ! But once again, I can do it too :)

I will work on updating the README and the changelog too, and maybe (or maybe for the next release) improving the form management in the browser.
But I think, and many thanks to you, than for the next release we can leave the "beta" status since we now have full tests and very good test coverage !

Thanks again for your contribution !

Karec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants