[Feature Request] Parse the regular website instead of using the tumblr api #33

johanneszab · 2017-03-06T20:18:02Z

I've uploaded a branch here where my first steps are visible. If someone wants to participate, I think the hard thing is already done: How to parallelize it?

By using the archive page it's possible to determine the lifetime of the blog. Now we can access the archive page by using _archive?before_time= where the before time is unix time. So we could start the crawl at several times during the blogs lifetime. Say xxx crawls per month and stop the crawlers once it hit a post of a different crawler started at an earlier time.

I've uploaded a functional commit for image crawling without accessing the tumblr api. Still needs optimization and a code rebase to the current master branch.

johanneszab · 2017-03-09T19:55:16Z

For people only interested in photos and videos, here we go scraping the website. Might be faster since the scan connections can be cranked up in the settings as it does not depend on the Tumblr api anymore.
I've also played around with saving the cookie from the login process using the .NET browser (some IE/Edge version) for private blog download, but unfortunate there are some redirects and the resulting site is completely different. So there is some detection missing and requires a complete set of code for its own.

No tagging support yet.
ONLY photos (no photosets nor inline photos) and videos yet.

sbobbo · 2017-03-23T14:38:02Z

That's definitely preferable for me. I only want pictures, video, and gifs. I assume photos includes gifs?

FWIW, 1.0.4.31 never successfully downloaded anything for me coming from my 1.0.4.18 index files, although that doesn't matter much if the non-api verison works. I'll test in a bit.

johanneszab · 2017-03-23T16:13:59Z

I've rebased it again to the current master release (the one that uses the v1 api) @bb20189. Nothing is tested at all.

I too think that's the way to go because we could potentially access private blogs, no connection limits. It's probably even worth considering to just use RegEx and filter all jpg/png/gifs instead of using html nodes. It's much less a hassle to write the code and even if tumblr changes their layout, nothing should be broken if we just scan the whole html for specific tags.

I think it does not detect photosets nor inline images yet.

sbobbo · 2017-03-23T19:13:40Z

1.0.5.2 and 1.0.4.35 cannot load my index files. "Error 1: Could not load file in library:" when I launch the program

I also just realized that 1.0.4.18 must have messed up some of my index files last month when I ran it....I have blank "Downloaded Files"(But "Number of Downloads" is still populated) for some long running blogs, and the index files are very small. When I try to download again with 1.0.4.18, sure enough, it downloads every post ever instead of just the ones from the last month. I made sure to never expose said index files to any of the newer versions, I keep a backup, and a good number of the other blogs seem fine.

Argh, not sure how to deal with this....

sbobbo · 2017-03-23T19:15:37Z

1.0.5.1 seems to work with those broken index files though....I assume because it's only grabbing the posts since the last completion date? While that idea is great in theory, I worry about bugs, like if the program crashes mid crawl or something.

Is that version safe to use for now? If I use that, will my index files become incompatible moving forward?

johanneszab · 2017-04-09T08:55:30Z

So, we could also use a WebDriver, which essentially is a chrome/firefox without the UI. That would also simplify the private blog access. It probably uses lots of memory and fattens the application a lot, but it sure is the most simple implementation for now.

johanneszab · 2017-06-22T19:34:56Z

New version:
TumblThree-v1.0.5.72-Application.zip
TumblThree-v1.0.5.72-Translations.zip

willemijns · 2017-08-13T19:04:05Z

Hello, i do not seen any settings or things about this no-use of API,

johanneszab · 2017-08-13T19:13:25Z

The rate limiter is not included in this release, thus it's not limited and just connects to the normal website depending on the remaining connection settings.

You can adjust the scan connections with the scan connection settings and if you want to include the scanning connections in the bandwidth throttler.

willemijns · 2017-08-13T21:01:09Z

ok tks i understand now ;)

johanneszab added enhancement help wanted labels Mar 6, 2017

johanneszab mentioned this issue Mar 8, 2017

Tumblr now limits access to its version 1 api. #26

Closed

johanneszab mentioned this issue Jun 15, 2017

Tumblr.com not accessible using VPN. #91

Closed

johanneszab mentioned this issue Jun 22, 2017

Unable to Crawl At All #96

Closed

johanneszab changed the title ~~Parse the regular website instead of using the tumblr api~~ [Feature Request] Parse the regular website instead of using the tumblr api Aug 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Parse the regular website instead of using the tumblr api #33

[Feature Request] Parse the regular website instead of using the tumblr api #33

johanneszab commented Mar 6, 2017

johanneszab commented Mar 9, 2017 •

edited

Loading

sbobbo commented Mar 23, 2017

johanneszab commented Mar 23, 2017 •

edited

Loading

sbobbo commented Mar 23, 2017

sbobbo commented Mar 23, 2017

johanneszab commented Apr 9, 2017

johanneszab commented Jun 22, 2017 •

edited

Loading

willemijns commented Aug 13, 2017

johanneszab commented Aug 13, 2017

willemijns commented Aug 13, 2017

[Feature Request] Parse the regular website instead of using the tumblr api #33

[Feature Request] Parse the regular website instead of using the tumblr api #33

Comments

johanneszab commented Mar 6, 2017

johanneszab commented Mar 9, 2017 • edited Loading

sbobbo commented Mar 23, 2017

johanneszab commented Mar 23, 2017 • edited Loading

sbobbo commented Mar 23, 2017

sbobbo commented Mar 23, 2017

johanneszab commented Apr 9, 2017

johanneszab commented Jun 22, 2017 • edited Loading

willemijns commented Aug 13, 2017

johanneszab commented Aug 13, 2017

willemijns commented Aug 13, 2017

johanneszab commented Mar 9, 2017 •

edited

Loading

johanneszab commented Mar 23, 2017 •

edited

Loading

johanneszab commented Jun 22, 2017 •

edited

Loading