CSV printing speedup (#570)

* Closes #541 * Update GenericCodeInterface.py * fixed * fixed tester (#528) * ensemble model pb weights for variables coming from functions * fixed single-value-duplication error for SKL ROMs (#555) * fixed single-value-duplication error * xsd * fixed type * fixed test framework/ensembleModelTests.testEnsembleModelWith2CodesAndAliasAndOptionalOutputs * modified order of input output to avoid regolding * ok * Reducing DataObject Attribute Functionality (#278) * Enabling the data attribute tests and fixing the operators for PointSets. TODO: Break the data_attributes test down to be more granular and fix the outputPivotValue on the HistorySets. * Splitting the test files for the DataObject attributes and correcting some malformations in the subsequent input files. TODO: Fix the attributes for the history set when operating from a Model. * Fixing HistorySet data attribute test case to look for the correct file. * Correcting attributions for data object tests. maljdan had only moved the files. The original tests were designed by others. TODO: verify if test results are valid or the result of incorrect gold files. * Reducing the number of DataObjects needed in the shared suite of DataObject attribute tests. * Regolding the DataObject HistorySet attributes files to respect the outputPivotVal specified for stories2. * Picking up where I left off, trying to recall what modifications still need to be done to the HistorySet. * Regolding a test case on data attributes, removing dead code from the HistorySet and updating some aspects of the PointSet. * Removing data attribute feature set with explanation in comments. Cleaning old code. * Regolding fixed test case. * Reverting changes to ensemble test and accommodating unstructured inputs. * addressed misunderstanding in HistorySet * added HSToPSOperator PP * added documentation for new interface * finished new PP * addressed first comments * addressed Congjian's comments * updated XSD * moving ahead * fixed test framework/ensembleModelTests.testEnsembleModelLinearThreadWithTimeSeries * last one almost done * fixed framework/ensembleModelTests.testEnsembleModelLinearParallelWithOptimizer * fixed framework/CodeInterfaceTests.DymolaTestTimeDepNoExecutableEnsembleModel * almost done * fixed framework/PostProcessors/InterfacedPostProcessor.metadataUsageInInterfacePP * fixed new test files coming from devel * updated InterfacedPP HStoPSOperator * fixed xsd * added documenation for DataSet * added conversion script from old HDF5 to new HDF5 * Update DataObjects.xsd * remove white space * Update database_data.tex * testing printing * reverted to_csv for ND dataset. Need a good test for multiple-index dataset printing. * added benchmark results for numpy case
idaholab · Feb 9, 2018 · ab03f6e · ab03f6e
1 parent 3067d8b
commit ab03f6e
Show file tree

Hide file tree

Showing 2 changed files with 90 additions and 39 deletions.
diff --git a/doc/user_manual/existing_interfaces.tex b/doc/user_manual/existing_interfaces.tex
@@ -134,9 +134,9 @@ \subsection{RAVEN Interface}
 \label{subsec:RAVENInterface}
 The RAVEN interface is meant to provide the possibility to execute a RAVEN input file
 driving a set of SLAVE RAVEN calculations. For example, if the user wants to optimize the parameters
-of a surrogate model (e.g. minimizing the distance between the surrogate predictions and the real data), he 
+of a surrogate model (e.g. minimizing the distance between the surrogate predictions and the real data), he
 can achieve this task by setting up  a RAVEN input file (master) that performs an optimization on the feature
-space characterized by the surrogate model parameters, whose training and validation assessment  is performed in the SLAVE 
+space characterized by the surrogate model parameters, whose training and validation assessment  is performed in the SLAVE
 RAVEN runs.
 \\ There are some limitations for this interface:
 \begin{itemize}
@@ -150,7 +150,7 @@ \subsection{RAVEN Interface}
 \\ Similarly to any other code interface, the user provides paths to executables and aliases for sampled variables within the
 \xmlNode{Models} block.  The \xmlNode{Code} block will contain attributes \xmlAttr{name} and
 \xmlAttr{subType}.  \xmlAttr{name} identifies that particular \xmlNode{Code} model within RAVEN, and
-\xmlAttr{subType} specifies which code interface the model will use (In this case \xmlAttr{subType}=``RAVEN''). 
+\xmlAttr{subType} specifies which code interface the model will use (In this case \xmlAttr{subType}=``RAVEN'').
 The \xmlNode{executable}
 block should contain the absolute or relative (with respect to the current working
 directory) path to the RAVEN framework script (\textbf{raven\_framework}).
@@ -167,45 +167,45 @@ \subsection{RAVEN Interface}
           \begin{lstlisting}[language=python]
 def manipulateScalarSampledVariables(sampledVariables):
   """
-  This method is aimed to manipulate scalar variables. 
-  The user can create new variables based on the 
+  This method is aimed to manipulate scalar variables.
+  The user can create new variables based on the
   variables sampled by RAVEN
-   @ In, sampledVariables, dict, dictionary of 
+   @ In, sampledVariables, dict, dictionary of
        sampled variables ({"var1":value1,"var2":value2})
-   @ Out, None, the new variables should be 
+   @ Out, None, the new variables should be
                  added in the "sampledVariables" dictionary
   """
-  newVariableValue = 
-    sampledVariables['Distributions|Uniform@name:a_dist|lowerBound'] 
+  newVariableValue =
+    sampledVariables['Distributions|Uniform@name:a_dist|lowerBound']
     + 1.0
-  sampledVariables['Distributions|Uniform@name:a_dist|upperBound'] = 
+  sampledVariables['Distributions|Uniform@name:a_dist|upperBound'] =
     newVariableValue
   return
            \end{lstlisting}
 
-         \item \textbf{\textit{convertNotScalarSampledVariables}}, a method that is aimed to convert not scalar variables (e.g. 1D arrays) into multiple scalar variables 
+         \item \textbf{\textit{convertNotScalarSampledVariables}}, a method that is aimed to convert not scalar variables (e.g. 1D arrays) into multiple scalar variables
          (e.g.  \xmlNode{constant}(s) in a sampling strategy).
-          This method is going to be required in case not scalar variables are detected by the interface. 
+          This method is going to be required in case not scalar variables are detected by the interface.
           Example:
           \begin{lstlisting}[language=python]
  def convertNotScalarSampledVariables(noScalarVariables):
   """
-  This method is aimed to convert not scalar 
+  This method is aimed to convert not scalar
   variables into multiple scalar variables. The user MUST
    create new variables based on the not Scalar Variables
     sampled (and passed in) by RAVEN
-  @ In, noScalarVariables, dict, dictionary of sampled 
+  @ In, noScalarVariables, dict, dictionary of sampled
        variables that are not scalar ({"var1":1Darray1,"var2":1Darray2})
-  @ Out, newVars, dict,  the new variables that have 
-       been created based on the not scalar variables 
+  @ Out, newVars, dict,  the new variables that have
+       been created based on the not scalar variables
        contained in "noScalarVariables" dictionary
   """
   oneDimensionalArray =
       noScalarVariables['temperatureHistory']
   newVars = {}
   for cnt, value in enumerate(oneDimensionalArray):
     newVars['Samplers|MonteCarlo@name:myMC|constant'+
-               '@name=temperatureHistory'+str(cnt)] = 
+               '@name=temperatureHistory'+str(cnt)] =
                oneDimensionalArray[cnt]
   return newVars
            \end{lstlisting}
@@ -225,7 +225,7 @@ \subsection{RAVEN Interface}
     </Code>
 \end{lstlisting}
 
-Like for every other interface,  the syntax of the variable names is important to make the parser understand how to perturb an input file. 
+Like for every other interface,  the syntax of the variable names is important to make the parser understand how to perturb an input file.
 \\ For the RAVEN interface, a syntax inspired by the XPath nomenclature is used.
 \begin{lstlisting}[style=XML]
   <Samplers>
@@ -251,7 +251,7 @@ \subsection{RAVEN Interface}
 \begin{lstlisting}[style=XML]
       <Models>
         <ROM name="ROM1" subType="SciKitLearn">
-           ...   
+           ...
            <C>10.0</C>
           ...
        </ROM>
@@ -261,7 +261,7 @@ \subsection{RAVEN Interface}
 \begin{lstlisting}[style=XML]
       <Models>
         <ROM name="ROM1" subType="SciKitLearn">
-           ...   
+           ...
            <tol>0.0001</tol>
           ...
        </ROM>
@@ -275,9 +275,9 @@ \subsection{RAVEN Interface}
       <variable name="var1">
          ...
         <grid construction="equal" type="value" steps="1">0 1</grid>
-         ... 
+         ...
       </variable>
- 
+
       ...
     </MonteCarlo>
   </Samplers>
@@ -1303,20 +1303,20 @@ \subsubsection{Models}
     ...
 </Simulation>
 \end{lstlisting}
-RAVEN works best with Comma-Separated Value (CSV) files.  Therefore, the default 
+RAVEN works best with Comma-Separated Value (CSV) files.  Therefore, the default
 .mat output type needs to be converted to .csv output.
-The Dymola interface will automatically convert the .mat output to human-readable 
+The Dymola interface will automatically convert the .mat output to human-readable
 forms, i.e., .csv output, through its implementation of the finalizeCodeOutput function.
 \\In order to speed up the reading and conversion of the .mat file, the user can specify
-the list of variables (in addition to the Time variable) that need to be imported and 
-converted into a csv file minimizing 
-the IO memory usage as much as possible. Within the \xmlNode{Code} the following 
-XML 
+the list of variables (in addition to the Time variable) that need to be imported and
+converted into a csv file minimizing
+the IO memory usage as much as possible. Within the \xmlNode{Code} the following
+XML
 node (in addition ot the \xmlNode{executable} one) can be inputted:
 
 \begin{itemize}
-   \item \xmlNode{outputVariablesToLoad}, \xmlDesc{space separated list, optional 
-   parameter}, a space separated list of variables that need be exported from the .mat 
+   \item \xmlNode{outputVariablesToLoad}, \xmlDesc{space separated list, optional
+   parameter}, a space separated list of variables that need be exported from the .mat
    file (in addition to the Time variable). \default{all the variables in the .mat file}.
 \end{itemize}
 For example:

diff --git a/framework/DataObjects/XDataSet.py b/framework/DataObjects/XDataSet.py
@@ -1711,16 +1711,22 @@ def _usePandasWriteCSV(self,fileName,data,ordered,keepSampleTag=False,keepIndex=
         if not localIndex:
           data.to_csv(fileName+'.csv',mode=mode,header=header, index=localIndex)
         else:
-          ##FIXME: This is extremely bad and not elegant
-          ##FIXME:  It is just needed to go on with the "regolding" of the tests
-          dataString = data.to_string()
+          data.to_csv(fileName+'.csv',mode=mode,header=header)
+          ## START garbled index fix ##
+          ## At one point we were seeing "garbled" indexes printed from Pandas: a,b,(RAVEN_sample_ID,),c
+          ## Here, commented is a workaround that @alfoa set up to prevent that problem.
+          ## However, it is painfully slow, so if garbled data shows up again, we can
+          ##   revisit this fix.
+          ## When using this fix, comment out the data.to_csv line above.
+          #dataString = data.to_string()
           # find headers
-          splitted = [",".join(elm.split())+"\n" for elm in data.to_string().split("\n")]
-          header, stringData = splitted[0:2], splitted[2:]
-          header.reverse()
-          toPrint = [",".join(header).replace("\n","")+"\n"]+stringData
-          with open(fileName+'.csv', mode='w+') as fileObject:
-            fileObject.writelines(toPrint)
+          #splitted = [",".join(elm.split())+"\n" for elm in data.to_string().split("\n")]
+          #header, stringData = splitted[0:2], splitted[2:]
+          #header.reverse()
+          #toPrint = [",".join(header).replace("\n","")+"\n"]+stringData
+          #with open(fileName+'.csv', mode='w+') as fileObject:
+          #  fileObject.writelines(toPrint)
+          ## END garbled index fix ##
       # if keepIndex, then print as is
       elif keepIndex:
         data.to_csv(fileName+'.csv',mode=mode,header=header)
@@ -1729,6 +1735,51 @@ def _usePandasWriteCSV(self,fileName,data,ordered,keepSampleTag=False,keepIndex=
         data.to_csv(fileName+'.csv',index=False,mode=mode,header=header)
     #raw_input('Just wrote to CSV "{}.csv", press enter to continue ...'.format(fileName))
 
+  # _useNumpyWriteCSV (below) is a secondary method to write out POINT SET CSVs.  When benchmarked with Pandas, I tested using
+  # different numbers of variables (M=5,25,100) and different numbers of realizations (R=4,100,1000).
+  # For each test, I did a unit check just on _usePandasWriteCSV versus _useNumpyWriteCSV, and took the average time
+  # to run a trial over 1000 trials (in seconds).  The results are as follows:
+  #    R     M   pandas     numpy       ratio    per float p  per float n  per float ratio
+  #    4     5  0.001748  0.001004  1.741035857   0.00008740   0.00005020  1.741035857
+  #    4    25  0.002855  0.001378  2.071843251   0.00002855   0.00001378  2.071843251
+  #    4   100  0.007006  0.002633  2.660843145   0.00001752   6.5825E-06  2.660843145
+  #  100    5   0.001982  0.001819  1.089609676   0.00000396   0.00000364  1.089609676
+  #  100   25   0.003922  0.003898  1.006182658   1.5688E-06   1.5592E-06  1.006182658
+  #  100  100   0.011124  0.011386  0.976989285   1.1124E-06   1.1386E-06  0.976989285
+  # 1000    5   0.004108  0.008688  0.472859116   8.2164E-07   1.7376E-06  0.472859116
+  # 1000   25   0.013367  0.027660  0.483261027   5.3468E-07   1.1064E-06  0.483261027
+  # 1000  100   0.048791  0.095213  0.512442602   4.8791E-07   9.5213E-07  0.512442602
+  # The per-float columns divide the time taken by (R*M) to give a fair comparison.  The summary of the # var versus # realizations per float is:
+  #          ---------- R ----------------
+  #   M      4             100        1000
+  #   5  1.741035857  1.089609676  0.472859116
+  #  25  2.071843251  1.006182658  0.483261027
+  # 100  2.660843145  0.976989285  0.512442602
+  # When the value is > 1, numpy is better (so when < 1, pandas is better).  It seems that "R" is a better
+  # indicator of which method is better, and R < 100 is a fairly simple case that is pretty fast anyway,
+  # so for now we just keep everything using Pandas. - talbpaul and alfoa, January 2018
+  #
+  #def _useNumpyWriteCSV(self,fileName,data,ordered,keepSampleTag=False,keepIndex=False,mode='w'):
+  #  # TODO docstrings
+  #  # TODO assert point set -> does not work right for ND (use Pandas)
+  #  # TODO the "mode" should be changed for python 3: mode has to be 'ba' if appending, not 'a' when using numpy.savetxt
+  #  with open(fileName+'.csv',mode) as outFile:
+  #    if mode == 'w':
+  #      #write header
+  #      header = ','.join(ordered)
+  #    else:
+  #      header = ''
+  #    #print('DEBUGG data:',data[ordered])
+  #    data = data[ordered].to_array()
+  #    if not keepSampleTag:
+  #      data = data.drop(self.sampleTag)
+  #    data = data.values.transpose()
+  #    # set up formatting for types
+  #    # TODO potentially slow loop
+  #    types = list('%.18e' if self._getCompatibleType(data[0][i]) == float else '%s' for i in range(len(ordered)))
+  #    np.savetxt(outFile,data,header=header,fmt=types)
+  #  # format data?
+
 
   ### HIERARCHICAL STUFF ###
   def _constructHierPaths(self):