Skip to content
This repository has been archived by the owner on Mar 9, 2021. It is now read-only.

[Question] Discussion and Questions #112

Open
johanneszab opened this issue Jul 19, 2017 · 53 comments
Open

[Question] Discussion and Questions #112

johanneszab opened this issue Jul 19, 2017 · 53 comments

Comments

@johanneszab
Copy link
Owner

For non-issue related questions, please ask here instead of creating new issues.

@Taranchuk
Copy link

Thank you for the thread!
In the settings, I'm a bit confused that there are several connection settings, I would like to understand exactly what they are.
The value of parallel blogs and parallel connections is the number of connections to the Tumblr. If set the value of 20 parallel connections and 2 parallel blogs, there will be 10 streams of downloading to Tumblr servers, is this true?
Further below there are functions of scan connections and the number of connections to Tumblr api. Their values are 4 connection scans and 60 connections to Tumblr api.
I guess that the settings for the parallel connection determine the number of streams of downloading files by the links. A connection to api is about how to get these links from Tumblr api. And if all is true, then what is scan connection, the value of which is set to 4? Does this function somehow relate to the connection to the Tumblr api? I tried to put this value on 1000 and start downloading blogs (specifically the image metadata) and did not notice the error "Limit Exceeded", but also did not notice any apparent increase in download speed. Will it be better to leave this value at 1000 or better to return to 4?
Also here there are two settings "Timeout" and "Time interval". I understand that the upper one is about the maximum duration of the connection of downloading files, and the lower one is about the maximum duration of the connection to the Tumblr api, after which these connections are forcibly terminated by the program? Will it be better for speed performance if I increase the time in the setting of the time interval for Tumblr api? Sometimes I just notice that the program does not download some part of the metadata, and after this did not see any error Limit Exceeded, perhaps because of the timeout of the connection to the api.

@ghost
Copy link

ghost commented Jul 29, 2017

Hello.

I was wondering where can I get the .exe file of the latest release. Unfortunately, I don't have VS2015 or higher, but I wanted to test out the app.

@johanneszab
Copy link
Owner Author

@AnryCryman: Under releases, download the latest release. Currently that is v1.0.8.4, so the right file is TumblThree-v1.0.8.4.zip

@ghost
Copy link

ghost commented Jul 29, 2017

@johanneszab Yeah, I downloaded it. But there are no executables there, only source code. Can you possibly email me the .exe file of the latest release to anrycryman@gmail.com?

@johanneszab
Copy link
Owner Author

..

I've uploaded that file myself and I'm pretty sure that there is a file called TumblThree/TumblThree.exe in that particular zip file. I cannot send you the .exe itself since it needs some more .dlls which are included in the zip file. Thus, I'd have to send you the exact same file I've linked above.

Why do you download the file called source code if you don't want the source code? Since I've already received five similar emails there must be a reason. Should I rename the link to binary? Did you download the source code .zip file from the main page by pressing the green download or clone-button?

@ghost
Copy link

ghost commented Jul 29, 2017

@johanneszab Sorry, my mistake. Must have hit the wrong link. I downloaded TumblThree-v1.0.8.4.zip and found TumblThree.exe in there.
Is there a way to explicitly specify the language of the app?

@johanneszab
Copy link
Owner Author

@Taranchuk:

The value of parallel blogs and parallel connections is the number of connections to the Tumblr. If set the value of 20 parallel connections and 2 parallel blogs, there will be 10 streams of downloading to Tumblr servers, is this true?

No, there will be 20 streams opened to the Tumblr servers. Actually it was more hard coded at the beginning. Right now it checks the current amount of active blogs give each active blog it slice of downloads. Thus, if you have the parallel connections setting set to 20 but only one blog in the queue active, it will consume all 20 connections. If you have 2 active blogs, they both will get 10 streams. It's probably a bit wonky but should work most of the time.

Further below there are functions of scan connections and the number of connections to Tumblr api. Their values are 4 connection scans and 60 connections to Tumblr api.

I've decoupled the scan/crawler connection at some point from the above settings. The Tumblr api/svc service/the parsing of the website usually is quite quick since it's only a few KB text. In the beginning of TumblThree the crawler was started first and after it finished the downloader started. Thus it made sense to allow more connections for parsing the website and grabbing the urls as for downloading the heavy binary data.

Right now the values are superfluous because of two reasons:

  1. The downloader starts immediately after the crawler dropped the first image/video/metadata url in the queue. So the waiting time until the first actuall download starts is mostly neglectable now.
  2. The Tumblr api is rate limited now. This means they only allow a specified number of connections to the api per a specific time period. Thus, even if you increase the scan connections but have the "Limit the scan connections to the Tumblr api" checkbox ticked, the connections are queued until a free slot is available. Thus, it bascially makes no difference since the rate limiter is the limiting factor.
  3. If you use the SVC Release or the parsing release however, you can increase or turn off the "Limit the scan connections to the SVC Service". I've discovered the svc service during my implementation of the private blog downloader. It basically outputs even more data about the posts of a blog than the Tumblr api but seems not be limited. They possibly cannot even do this since their webpage depends on it. I've implemented most features in that branch already. You'll have to try, I don't know if they'll eventually limit it (if abused).

@johanneszab
Copy link
Owner Author

Also here there are two settings "Timeout" and "Time interval". I understand that the upper one is about the maximum duration of the connection of downloading files, and the lower one is about the maximum duration of the connection to the Tumblr api, after which these connections are forcibly terminated by the program?

Exactly.

  • The Timeout (s) value is the maximum time the stream to the server stays open if there is no activity on it. Thus, if there comes no data back for 120 seconds, the stream is closed.
  • The Time Interval (s): belongs the the Limit connections to the Tumblr Api setting. If you enable the checkbox, TumblThree allows x number of connections per y Time Interval in seconds to the Api. E.g., the default allows 90 connections per 60 seconds. For me this value finally works without any forcefully closed connections (e.g. Limit Exceeded -- 403 error messages). Keep in mind that this is a global value. If you browse the api manually or open TumblThree twice from the same connection, your connections might still be dropped. Thus, if you open TumblThree twice, you'd have to halve the value in each instance (45 connection per 60 seconds).

@johanneszab
Copy link
Owner Author

Also take a look into #107 for some more program details.

@shakeyourbunny
Copy link

shakeyourbunny commented Aug 8, 2017

Please redesign the whole UI to sane level, where it conforms to common expectations:

  • Selecting blogs is slow and does not do the obvious things, especially the automatically color changes in the lines are very confusing, perhaps add a checkbox for selecting.
  • How do i delete / select / rearrange things in the download list?
  • Whole UI is very sluggish (on a i6770K 3,4Ghz ...)
  • Replace the whole color codes in the listing with some describing text or add some what it means.
  • add a (scrolling) log tab what it currently is busy with.
  • there is no "update (all) blogs" (with a "force check" checkbox) on the button bar.
  • why does it not download the images after extracting the URLs?
  • progress bar in the main window is very confusing, especially there is "green" and not filled. Why?

@Kvothe1970
Copy link

I wonder: Would it be possible to be able to auto upload files to queue them? As in point the app to a folder / folders and provide a text file with a tag or make it configurable. Have the app then process the folder, upload and queue the images as per setting (one at a time, two, three, four) add the tag etc.
This would TumblrThree even more than an amazing backup tool.

@johanneszab
Copy link
Owner Author

@Kvothe1970:
Nice idea. Should be possible yes. There already is a filesystem monitor api in C#, thus implementing this should be more or less straight forward.

Maybe it's a good thing to also implement TumblThree GUI-less at the same time and let it start it from the command line. That might reduce resource usage.

@Kvothe1970
Copy link

@johanneszab Considering I am a big fan of GUIs I would support this being optional ;)

@Emphasia
Copy link

Emphasia commented Aug 16, 2017

Can I download my own liked photos and videos?
I tried "liked/by/myaccount" but it just shows:"Request denied.You do not have permission to access this page."

@johanneszab johanneszab changed the title Discussion and Questions [Question] Discussion and Questions Aug 27, 2017
@Taranchuk
Copy link

Taranchuk commented Aug 28, 2017

  1. What is this file *_files.tumblrtagsearch, which lies in the folders from the tag downloader? I looked inside and it turned out that the filenames are stored here. But what's the matter: the filenames there are a few hundred more than there is really in the folder. Why can this be so and what does it affect? It may be that there are skipped files that were not downloaded the first time and the program can not download them again because these files are already on the list just like with index files?
  2. Are the search and tag downloaders associated with the tumblr api? Is it possible to disable the api limit in the settings and run several instances for downloading by tag and search keywords without the risk of not getting the some files that will be skipped during the download? I have already tried this and have not yet encountered a limit error, but I would not like to think that it is possible that the limit detection monitor is simply not built into these functions and I just do not notice the missing files.
  3. Is it possible to get some metadata from the search and tag pages? It seems to me that it's impossible to completely get them, but is it possible to get a full list of blogs on these pages, where were the images downloaded from? It was very much like one day to see the function of downloading a list of blogs from search and tag pages so that can select the most frequently blogs and add them to the program because perhaps they contain good content if they offer a lot of content that I'm interested in search and tag pages.

@johanneszab
Copy link
Owner Author

  1. Only the regular Tumblr blog downloader in the 1.0.8.X releases use the Tumblr api.
    The search downloader parse the regular website, so you should be able to run multiple instances without any problems.
    The SVC release (1.0.7.X) and the downloader for private blogs in the normal release (1.0.8.X) use a web service that is required by the browser to display the website itself. So it might eventually be rate limited it, but I don't think so.

@douww2000
Copy link

Thank you for the great tool!
One question, when I first time run the application, in Details panel I can see the textboxes for download time (from ~ to), but when I choose a blog, these textboxes disappeared, is this a function in progress or I used it in a wrong way?

@Taranchuk
Copy link

douww2000, this is for tag pages only. If you need to partially download blogs, use the function of downloading pages. For example, if you only need 1000 last posts and you have 50 posts per page by default in the detail views, then set the interval to 1-20 in the field "Download pages:". Or 1-1000, if you set 1 posts per page.

@johanneszab
Copy link
Owner Author

johanneszab commented Sep 2, 2017

Well, not entirely right. I've included it in the release notes (v1.0.8.18) because downloading posts in a defined time span is possible for (private) blog downloads too.

So, I guess you'll have to update to the latest version.

@PonyGirl6763
Copy link

I can't get into any private blogs. I went to settings and successfully authenticated with my Tumblr login credentials, but none of the private blogs I want to back up will download? I've attempted both on a friend's private blog and my own private blog and neither will work. Am I missing a step?

@johanneszab
Copy link
Owner Author

johanneszab commented Sep 4, 2017

Am I missing a step?

Yes. Describing exactly what happens if you try to download a private blog.
What do you see in the queue progress? What happens with the blog, does it just finish or hang? What did you select in the Details window for the private blog? Any tags? And maybe posting the url here, so that someone can check if it actually works.

Since you aren't the first person reporting this (see #118 for more), there might be something missing, but the blog posted there actually worked for me. Thus, I cannot do anything, since I cannot reproduce the error.

Of course, you could also debug the code/error yourself, if there is one after all ..

@johanneszab
Copy link
Owner Author

Ok, that won't work right now since you need to password to view your blog.

What I meant with a private blog is a blog like this: https://privtumbl.tumblr.com/ where you need to be logged in in order to see them.

It's probably possible to implement something that it will work with password protected blogs too, but it's not possible right now.
How are these things called? it's weird tho since they are called differently all over the place. At least the last time I've looked.

@johanneszab
Copy link
Owner Author

johanneszab commented Sep 4, 2017

hmm, it's way easier than I though. You just have to do an additional POST request with the password in the body before browsing the blogs, that's it. All the other code can be reused I guess.

Looks easy to do, but I don't have any time for this right now.

tumblr_passwordprotected2

@johanneszab
Copy link
Owner Author

johanneszab commented Sep 4, 2017

Okay, one thing that works in the meantime is that you just update the cookie with the Internet Explorer. TumblThree uses the Internet Explorer to login to tumblr.com. The Internet Explorer is just opened in a different window. Thus, they share the same cookie.

To download your password-protected blog, try the following:

  • Start the Internet Explorer, enter your blogs URL, enter the password, load your blog once.
  • Now you can re-add your blog to TumblThree and it should work. At least in my short test for my test blog here. You have to be authenticated though, even if you have a "public" blog until I'll update the code properly.

@johanneszab
Copy link
Owner Author

johanneszab commented Sep 4, 2017

@PonyGirl6763:
I think this will work for you (downloads password protected blog). You'll have to supply the blogs password in the Details tab:

TumblThree-v1.0.8.41-Application.zip

@johanneszab
Copy link
Owner Author

Something like a hidden (login required) and password protected blog does not exist? I can set both options in my second tumblr blog, but then it's impossible to access it from another account?

After I login with a second account, i always get a 404 page (this tumblr does not exist) without seeing any password request page at all.

@AriinPHD
Copy link

AriinPHD commented Feb 26, 2018

Hi @johanneszab, thanks for this questions-threat (and for a really amazing good app)! :)
I have a question about the "Downloaded Files" vs "Number of Downloads". Most of the time these numbers don't match up; why is that?
What do the numbers represent? I first assumed that "Number of Downloads" was the total number of downloads available after filtering (settings) and that "Downloaded Files" was a way to confirm that you had scraped 100% of the available gallery, but seeing the inconsistency makes me realize they are not working like that at all. Can you please explain? :)
tumblthree_downloaded-items_2018-02-26_13-56-21
Screenshot: Both blog crawls are complete.

@johanneszab
Copy link
Owner Author

Previously it was implemented differently, but the number of the download is the number of downloads (posts, videos, images, external images/videos) TumblThree detected during the current crawl with your given settings, yap. Thus, it's not the total number of possible downloads, nor the total number of posts of the blog. Previously I've tried to calculate the total number, but it's never really consistent.

As I've just mentioned, the number of posts can be lower than the downloads if the blog contains a picture set as TumblThree will download all pictures from that set. Or there is an embedded picture withing a post. Someone deletes things from the blog, then the Downloaded Files will be higher than the Number of Downloads. It just was never really right, and people kept complaining, thus I've changed it to the current behavior.

It should be (almost) complete in your case if you download the whole blog at once, yes. But some urls TumblThree grabs aren't accessible on the Tumblr servers anymore. I've seen a few cases (pictures), and I'm sure those images are the reason for the lower count. They just return a 403 error code. I cannot give you an example right now though.

So, it's more or less a rough estimate.

@AriinPHD
Copy link

I see, great response @johanneszab thanks! :)

I'll keep that in mind and, I assume, I can safely ignore the numbers and use them only as an estimate of amounts/size. :)

Thanks again!

@Hrxn
Copy link

Hrxn commented May 29, 2018

A short question: How does TumblThree determine its "duplicate found" elements?

I have tried the program with a single blog (with a rather big post count) and the number of found duplicates seems a bit high to me. Although that's just a guess, I admit. But given that each post on a blog has its own unique post ID, it can't be the posts itself, or am I mistaken?

@johanneszab
Copy link
Owner Author

How does TumblThree determine its "duplicate found" elements?

It simply count the occurrences of a generic data type that show up in the download queue. The queue is filled from the website/api crawler tasks and are emptied from the downloader task. For photos/videos/audio files, it's based on the url. For text posts, based on the post id.

@Hrxn
Copy link

Hrxn commented May 31, 2018

@johanneszab Ah, okay. But the url is unique for media files? As opposed to post IDs?

@dsteiger
Copy link

dsteiger commented Dec 6, 2018

Is it possible to download the items in Drafts?
This URL: https://www.tumblr.com/blog/yourblognamehere/drafts

@tehgarra
Copy link

tehgarra commented Dec 6, 2018

is it possible to get an outout of the errors at the top of the window since there's no way to scroll or wraparound?

apparently some of the _files.tumblr files disappeared when i updated, and i don't know what all of them are due to limited frame where it displays. the tooltip displays "Serialization Error"

@johanneszab
Copy link
Owner Author

@dsteiger: Currently. no. Possible, I don't know. I've never looked into that.

@johanneszab
Copy link
Owner Author

@tehgarra: does it help to move the mouse cursor on the error? If the information isn't in the tool tip, then, no.

@tehgarra
Copy link

tehgarra commented Dec 6, 2018

@johanneszab it doesn't. i'll just try matching matching everything and see which ones have the _file.tumblr files missing and redownload from there. i think that's my best option. in context i transfered everything from one harddrive to another and haven't had any problems so far, but then i updated the version and came across that error, that's all.

i'll try deleting the child id for the directory and see if that works

imo the only real issue is the entry being deleted off of the list of blogs

@johanneszab
Copy link
Owner Author

You can do a "blog url" export in the settings somewhere once you've loaded all your blogs again.

With that file, if you open it in an editor, you can simply re-add all missing blogs by just copy them all to the clipboard (ctrl-a, ctrl-c) and letting the clipboard monitor add the missing blogs.

@dsteiger
Copy link

dsteiger commented Dec 6, 2018

@dsteiger: Currently. no. Possible, I don't know. I've never looked into that.

That's a shame, but understandable. And probably too complex to implement soon.
But it would've been useful to those migrating by Dec 17, 2018.
Many have more drafts than they can post by that deadline.

Thanks for the quick reply!

@britindc
Copy link

britindc commented Dec 6, 2018

Sorry for the newb question, but I haven't found an answer after reading the wiki or (briefly) looking through submissions:

Is there a way, after fully downloading a blog, to go back a week later and have TumblThree update it to include posts that have been posted to that blog since then?

Great program by the way; I was happy to "donate a beer" :-)

@tehgarra
Copy link

tehgarra commented Dec 6, 2018

Sorry for the newb question, but I haven't found an answer after reading the wiki or (briefly) looking through submissions:

Is there a way, after fully downloading a blog, to go back a week later and have TumblThree update it to include posts that have been posted to that blog since then?

Great program by the way; I was happy to "donate a beer" :-)

i usually just put it back in the crawl queue and it starts downloading the new images. haven't had any issues at all with it so far. sometimes i'll do a rescan but that isn't always necessary imo unless you had files moved and wanted to redownload them

@tehgarra
Copy link

tehgarra commented Dec 8, 2018

@johanneszab does tumblthree have the ability to delete files? the reason i ask was because i downloaded files from a blog, and ended up having to redownload the blog again and the filecount in the foulder decreased noticable from before I redownloaded. Could that possibly be duplicate removal?

@bepis
Copy link

bepis commented Dec 11, 2018

In what file are the blog URLs stored? Because I deleted a blog folder but it remains in the program's list of URLs. It gives an error when I shift-delete it.

@tehgarra
Copy link

tehgarra commented Dec 11, 2018 via email

@tommynomad
Copy link

Hallo Johannes & thanks for the app & the opportunity to ask Qs.

I am using basic settings to d/l my own blog. I have managed to grab only a fraction of my posts, however. I hope you can see what I cannot, and advise me.

t3

@johanneszab
Copy link
Owner Author

Hi @tommynomad:

Thanks a lot for posting your question in this particular thread! And sorry for the late reply, but Tumblr should have picked a different day for this crap they are pulling, and not just before christmas, ...

Any chance you didn't tick the "download reblogged posts" checkbox further down in your appended screenshot?

tumblthree_nomadicpassions

I know, I should have reversed that logic and instead is should be a "download only original posts"-checkbox, thus the default would be to grab everything (#332).

@tommynomad
Copy link

tommynomad commented Dec 15, 2018 via email

@johanneszab
Copy link
Owner Author

Well, for me it seems to work. Maybe you can try to re-add this blog. Close TumblThree for this, and remove the corresponding index files in the Blogs\Index\ folder (e.g. nomadicpassions.tumblr
and nomadicpassions_files.tumblr).

@tommynomad
Copy link

tommynomad commented Dec 16, 2018 via email

@MustangWestern
Copy link

@johanneszab Probably too late for this, and I saw a question similar to it above but I didnt fully understand it. So I set up the app to download only the liked posts (I think), but the numbers are a bit screwy. According to tumblr, I have 2692 liked posts, but my numbers on the app currently look like this pic. Should it download properly? Regardless, great app and thanks for helping everyone above!

tumblethreee

@tehgarra
Copy link

tehgarra commented Dec 17, 2018 via email

@Taranchuk
Copy link

It seems that after December 17, TumblThree can still download NSFW images and metadata, although now I can no longer see these images in the browser. Probably, they did not close access to NSFW blogs through API.

@tehgarra
Copy link

tehgarra commented Dec 18, 2018

@johanneszab Is "CheckDirectoryForFiles": false in an individual blog.tumblr file used? I don't see a reference to it in the program itself, besides the "check for file existence globally across all loaded blogs". not sure if this has to do with the closing an reopening during a queue, but I feel like the program is just redownloading an overwriting images from a blog in this one case and I'm not sure how to make sure the files in the folder are being checked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests