-
-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transparent extraction of archives #14
Comments
I need some input on the best way transparent extraction would behave when reporting paths to files inside extracted archives and whether to handle extraction outside or inside/in-place of the scanned tree. In the current way, With a transparent archive extraction, we could either extract in place or not:
With a transparent archive extraction, we could report a real or a "virtual" path to an extracted file in the scanned results:
And if the extraction is not done in-place, then the paths would always be kinda virtual and absent of the scanned tree. So this about either extracting in place or not and which path to report in the scan results. @akintayo @chinyeungli @MaJuRG @nakami @rakeshbalusa @sschuberth @yahalom5776 Each of you has been involved with issues related to archive extraction. What would be your take and preference? What could be alternative ways? Thanks for your input! |
I believe extracting out-of-tree is the better / safer approach, simply because you don't have to worry about whether the directory you plan to extract to already exists. Also, it gives a cleaner separation between the "primary" source code / files, and files coming from the archives. Consequently, regarding the reporting I do like But speaking of in-place vs. out-of-tree extraction I wonder whether creating files is necessary at all. Why not simply stream the files from the ZIP and directly pass their contents to the scanning engine? That would probably increase performance by reducing file I/O, and also get rid of the need to delete the temporarily extracted files afterwards. |
@sschuberth your idea to stream read archives is intriguing. It can be done actually not only on zips but also on most archives that are handled by libarchive or in Python code. It would not work though for these handled by 7zip I thing. I will need to weight the benefits vs. the code simplicity. |
Some notes from the point of view of our use case:
What I'm using currently is our own wrapper script which extracts and runs |
@akaihola Thank you for the input. These are all valid points! |
Hi @pombredanne . I have started working on this. Kindly look at this PR - #544 . Thanks! |
@ashutoshsaboo I have some trouble to understand where you are going with this #544. It would make sense to lay out your approach first in prose here. The principle is overall simple:
There are other details of course such as dealing with paths: both the real path we want to report and the "internal" extracted path where the file lives internally for scanners to process it would need to be returned by the iterator. Eventually creating a small File object may be a clean design. And may be some extra file infor level data is needed to understand that a file is extracted and not a plain file. |
@pombredanne Hi, I have replied to this on my PR thread, in the last comment - #544 . Would be nice to have your inputs on the same. 😄 |
* Add Codebase asbstraction as an in-memory tree of Resource objects * Codebase and Resources can be walked, queried, added, removed as needed topdown and bottom up with sorted children. * Root Resource can now have/has scans and info for #543 * Codebase Resource have correct counts of children for #607 and #598 * Files can also have children (this is in preparation for transparent archives extraction/walking for #14) * Initial inventory collection is based on walking the file system once All other accesses are through the Codebase object * Resource hold a scans mapping and have file info directly attached as attributes. * To support simple serialization of Resource, these are not holding references to their parent and children: instead they hold numeric ids, including a Codebase id that can be accessed through a global cache, which is a poor man weak references implementation. * Remove and fold caching into resource.py at the Resource level. Each resource can put_scans and get_scans. This is either using the on-disk cache or just attached in memory to the resource object. * Add minimal resource cache tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Since #885 this will be possible through a pre-scan plugin that dives into archives. |
So after a long time, I think that what we really want is not exactly transparent extraction of archives, but rather smart and selective extraction of archives where relevant in the context of a specific archives and scans. |
Merge changes from develop to main
Add support for gems and improve RPM support
As noted in #3, we do not extract and scan at the same time.
A better way would be to handle internally an archive as if it were a special type of directory (both contain files after all), and when a single archive scan is requested (or when archives are found in a larger scan) we could extract these temporarily to a temp directory, scan the extract and return the results. This would require a bit more thinking to get it right.
At a high level a tree with archives would be considered the same as a tree with directories. Archives would become just a special type directory-like containers for more files.
We could expose an
os.walk
-like function that would transparently extract archives to a temp directory and yield a real path and the temp location of a given fileThe text was updated successfully, but these errors were encountered: