Changed the get_ext file extension returning logic. #8

Ayoob7 · 2019-03-09T15:16:27Z

Changed the get_ext file extension returning logic.

Thought of working on the file Downloading next, do you think it would be a good idea to use a nifty Python package called PySpark, if there is a possibility that we will need to implement Big Data (using Apache Spark) approaches to this server ( a server that can run across the shared resources of many computers) for large scale distributed processing??

Apache Spark is one of the best engines, for data heavy workloads and transformations to be done in distributed computing.

Fixed get_ext()

Ayoob7 · 2019-03-10T06:52:20Z

test cases : Passed
uuid.txt.tar1.gz => txt.gz
uuid.txt.tar.gz => txt.tar.gz
uuid.txt1.tar1.gz1 => gz1

Changed the get_ext file extension returning logic.

Thought of working on the file Downloading next, do you think it would be a good idea to use a nifty Python package called PySpark, if there is a possibility that we will need to implement Big Data (using Apache Spark) approaches to this server ( a server that can run across the shared resources of many computers) for large scale distributed processing??

Apache Spark is one of the best engines, for data heavy workloads and transformations to be done in distributed computing.

yunhailuo · 2019-03-10T07:21:35Z

My original assumption is GDC's filenames various a lot but follow one rule which is having all metadata on the "left" and extensions all one the "right", or having metadata before extensions. For example, in "THETA_p_TCGA_Batch14_SNP_N_GenomeWideSNP_6_C03_455318.nocnv_grch38.seg.v2.txt", "THETA_p_TCGA_Batch14_SNP_N_GenomeWideSNP_6_C03_455318.nocnv_grch38.seg.v2" are all metadata and "txt" is an extension. Meanwhile, since this function is only used for renaming files, it doesn't matter that much whether this script recognize the extension or not, what matters more is I don't lose any extension during renaming even the script don't recognize some extension.

So for your first test case, "uuid.txt.tar1.gz", since txt is an extension rather than metadata, tar1.gz must also be extensions based on the assumption mentioned before. So it should be uuid.txt.tar1.gz => txt.tar1.gz; thus will be renamed as rename.txt.tar1.gz. This is what I mean by saying even the script doesn't recognize tar1 as an extension, it still preserves it based on the assumption and assuming tar1 is an unknown extension.

The reason I said it is not a bug in #1 is I didn't see the code breaks as long as my old assumption holds true. That being said, things could have changed and my old assumption may not be true. I suggest you study the filename on GDC a little bit, find some use cases and figure out how the original code breaks for those cases.

I'm not convinced about introducing PySpark. I may have the same suggestion, what some use cases would be here. Scripts for this repo is for data transformation and loading for Xena. In my opinion, the data transformation here is very basic and I haven't yet seen a need for advanced Big Data tool. And this repo is not for building any servers now or in my foreseeing future. I do respect your suggestions. So if you can list some potential use cases which may need or benefit from PySpark, that will be great. Thank you.

Ayoob7 added 2 commits March 9, 2019 19:59

Fixed get_ext()

69aef61

Merge pull request #1 from Ayoob7/fix_Get_ext

47815f4

Fixed get_ext()

Ayoob7 closed this Mar 9, 2019

Ayoob7 reopened this Mar 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed the get_ext file extension returning logic. #8

Changed the get_ext file extension returning logic. #8

Ayoob7 commented Mar 9, 2019

Ayoob7 commented Mar 10, 2019

yunhailuo commented Mar 10, 2019 •

edited

Loading

Changed the get_ext file extension returning logic. #8

Are you sure you want to change the base?

Changed the get_ext file extension returning logic. #8

Conversation

Ayoob7 commented Mar 9, 2019

Ayoob7 commented Mar 10, 2019

yunhailuo commented Mar 10, 2019 • edited Loading

yunhailuo commented Mar 10, 2019 •

edited

Loading