Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing posts \ different meta information depends on json\text output #200

Closed
dajotim937 opened this issue Dec 27, 2021 · 5 comments
Closed

Comments

@dajotim937
Copy link

dajotim937 commented Dec 27, 2021

Just notice with this blog [redacted] (and I guess with some others too)
Checked them compare to each other and found that some post are in json output and not in text output and vice versa.
Here is json\text output files and third where I put list which posts in json and not in text and in text and not in json outputs.
[redacted]
[redacted]
Diffs.txt

Also, need to wrap json output into [ *output* ] to make json file correct and remove comma after last post before ].

@thomas694
Copy link
Contributor

Please choose your examples more carefully, no nsfw blogs in posts.

The rest I'll look at later.

@dajotim937
Copy link
Author

Oh, sorry. Didn't know just put example where noticed first.

@thomas694
Copy link
Contributor

In short, it's not a bug. The text/json setting only affects the format of the output files, but not how or what items are found/downloaded.

In detail, I downloaded the blog several times with gaps between 15 minutes to several hours and saw what I already assumed.
The blog details always say it has 2418 posts, but the amount of (media) items the api lets you download changed (3140, 3149, 3004, 3033, 3015, 2982, 2978, 2957).
In the middle, on one run it "finds" 100 "new" items and can't find 120 that it had before.
Normally a blue default image ("this content has been removed...") instead of the original media should be shown.
But at the end more and more posts itself "vanished", e.g.:

ID: 159330284307
Name part: tumblr_ol79aw6jBc1vqhtixo1_540
Post date: "2017-04-08 07:46:56 GMT"

At the moment the numbers don't go down any more, but it's unclear if it's the final state.
You can see that behavior especially with blogs that have posts which are years old and are flagged/rated nsfw and so on, and where all over sudden the whole blog history is accessed by one or more downloaders again.

If you do the same tests with blogs that have only posts within the last one or two years, you normally get the same result.

Regarding the json output, it was already addressed in #82 and hasn't been changed yet, because the app just appends the new elements instead of reading the whole file to replace the ending bracket and add the new elements. For now to use it in another app, comma needs to be removed and brackets added.

@dajotim937
Copy link
Author

The blog details always say it has 2418 posts, but the amount of (media) items the api lets you download changed (3140, 3149, 3004, 3033, 3015, 2982, 2978, 2957).

Interesting, why is that so? I saw in my metadata output post which doesn't exist in new output and post too, but images from post exist.

Regarding the json output, it was already addressed in #82 and hasn't been changed yet, because the app just appends the new elements instead of reading the whole file to replace the ending bracket and add the new elements. For now to use it in another app, comma needs to be removed and brackets added.

Well brackets shouldn't be too hard to add before and after loop or something where crawl is happening. As to comma, you could change and add it before or at start of new element output. And for first item just check if it first and don't add comma if it is. Or move out from loop first element or something like that.

Okay, thank you for response. It's up to you to close the issue.

thomas694 added a commit that referenced this issue Jan 14, 2022
- Until now new JSON elements have just been appended to the list/file. So the last element in the file ended with a comma and array brackets were missing totally.
- Now the files are written as and changed to complete JSON structure the next time an element is written to them.
@thomas694
Copy link
Contributor

Brackets have been added and the issue has been closed. You can still comment. Feel free to ask for reopening the issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants