Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed the get_ext file extension returning logic. #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Ayoob7
Copy link

@Ayoob7 Ayoob7 commented Mar 9, 2019

Changed the get_ext file extension returning logic.

Thought of working on the file Downloading next, do you think it would be a good idea to use a nifty Python package called PySpark, if there is a possibility that we will need to implement Big Data (using Apache Spark) approaches to this server ( a server that can run across the shared resources of many computers) for large scale distributed processing??

Apache Spark is one of the best engines, for data heavy workloads and transformations to be done in distributed computing.

@Ayoob7 Ayoob7 closed this Mar 9, 2019
@Ayoob7
Copy link
Author

Ayoob7 commented Mar 10, 2019

test cases : Passed
uuid.txt.tar1.gz => txt.gz
uuid.txt.tar.gz => txt.tar.gz
uuid.txt1.tar1.gz1 => gz1

Changed the get_ext file extension returning logic.

Thought of working on the file Downloading next, do you think it would be a good idea to use a nifty Python package called PySpark, if there is a possibility that we will need to implement Big Data (using Apache Spark) approaches to this server ( a server that can run across the shared resources of many computers) for large scale distributed processing??

Apache Spark is one of the best engines, for data heavy workloads and transformations to be done in distributed computing.

@Ayoob7 Ayoob7 reopened this Mar 10, 2019
@yunhailuo
Copy link
Collaborator

yunhailuo commented Mar 10, 2019

My original assumption is GDC's filenames various a lot but follow one rule which is having all metadata on the "left" and extensions all one the "right", or having metadata before extensions. For example, in "THETA_p_TCGA_Batch14_SNP_N_GenomeWideSNP_6_C03_455318.nocnv_grch38.seg.v2.txt", "THETA_p_TCGA_Batch14_SNP_N_GenomeWideSNP_6_C03_455318.nocnv_grch38.seg.v2" are all metadata and "txt" is an extension. Meanwhile, since this function is only used for renaming files, it doesn't matter that much whether this script recognize the extension or not, what matters more is I don't lose any extension during renaming even the script don't recognize some extension.

So for your first test case, "uuid.txt.tar1.gz", since txt is an extension rather than metadata, tar1.gz must also be extensions based on the assumption mentioned before. So it should be uuid.txt.tar1.gz => txt.tar1.gz; thus will be renamed as rename.txt.tar1.gz. This is what I mean by saying even the script doesn't recognize tar1 as an extension, it still preserves it based on the assumption and assuming tar1 is an unknown extension.

The reason I said it is not a bug in #1 is I didn't see the code breaks as long as my old assumption holds true. That being said, things could have changed and my old assumption may not be true. I suggest you study the filename on GDC a little bit, find some use cases and figure out how the original code breaks for those cases.

I'm not convinced about introducing PySpark. I may have the same suggestion, what some use cases would be here. Scripts for this repo is for data transformation and loading for Xena. In my opinion, the data transformation here is very basic and I haven't yet seen a need for advanced Big Data tool. And this repo is not for building any servers now or in my foreseeing future. I do respect your suggestions. So if you can list some potential use cases which may need or benefit from PySpark, that will be great. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants