Skip to content

Commit

Permalink
CSV printing speedup (#570)
Browse files Browse the repository at this point in the history
* Closes #541

* Update GenericCodeInterface.py

* fixed

* fixed tester (#528)

* ensemble model pb weights for variables coming from functions

* fixed single-value-duplication error for SKL ROMs (#555)

* fixed single-value-duplication error

* xsd

* fixed type

* fixed test framework/ensembleModelTests.testEnsembleModelWith2CodesAndAliasAndOptionalOutputs

* modified order of input output to avoid regolding

* ok

* Reducing DataObject Attribute Functionality (#278)

* Enabling the data attribute tests and fixing the operators for PointSets. TODO: Break the data_attributes test down to be more granular and fix the outputPivotValue on the HistorySets.

* Splitting the test files for the DataObject attributes and correcting some malformations in the subsequent input files. TODO: Fix the attributes for the history set when operating from a Model.

* Fixing HistorySet data attribute test case to look for the correct file.

* Correcting attributions for data object tests. maljdan had only moved the files. The original tests were designed by others. TODO: verify if test results are valid or the result of incorrect gold files.

* Reducing the number of DataObjects needed in the shared suite of DataObject attribute tests.

* Regolding the DataObject HistorySet attributes files to respect the outputPivotVal specified for stories2.

* Picking up where I left off, trying to recall what modifications still need to be done to the HistorySet.

* Regolding a test case on data attributes, removing dead code from the HistorySet and updating some aspects of the PointSet.

* Removing data attribute feature set with explanation in comments. Cleaning old code.

* Regolding fixed test case.

* Reverting changes to ensemble test and accommodating unstructured inputs.

* addressed misunderstanding in HistorySet

* added HSToPSOperator PP

* added documentation for new interface

* finished new PP

* addressed first comments

* addressed Congjian's comments

* updated XSD

* moving ahead

* fixed test framework/ensembleModelTests.testEnsembleModelLinearThreadWithTimeSeries

* last one almost done

* fixed framework/ensembleModelTests.testEnsembleModelLinearParallelWithOptimizer

* fixed framework/CodeInterfaceTests.DymolaTestTimeDepNoExecutableEnsembleModel

* almost done

* fixed framework/PostProcessors/InterfacedPostProcessor.metadataUsageInInterfacePP

* fixed new test files coming from devel

* updated InterfacedPP HStoPSOperator

* fixed xsd

* added documenation for DataSet

* added conversion script from old HDF5 to new HDF5

* Update DataObjects.xsd

* remove white space

* Update database_data.tex

* testing printing

* reverted to_csv for ND dataset.  Need a good test for multiple-index dataset printing.

* added benchmark results for numpy case
  • Loading branch information
PaulTalbot-INL authored and alfoa committed Feb 9, 2018
1 parent 3067d8b commit ab03f6e
Show file tree
Hide file tree
Showing 2 changed files with 90 additions and 39 deletions.
60 changes: 30 additions & 30 deletions doc/user_manual/existing_interfaces.tex
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,9 @@ \subsection{RAVEN Interface}
\label{subsec:RAVENInterface}
The RAVEN interface is meant to provide the possibility to execute a RAVEN input file
driving a set of SLAVE RAVEN calculations. For example, if the user wants to optimize the parameters
of a surrogate model (e.g. minimizing the distance between the surrogate predictions and the real data), he
of a surrogate model (e.g. minimizing the distance between the surrogate predictions and the real data), he
can achieve this task by setting up a RAVEN input file (master) that performs an optimization on the feature
space characterized by the surrogate model parameters, whose training and validation assessment is performed in the SLAVE
space characterized by the surrogate model parameters, whose training and validation assessment is performed in the SLAVE
RAVEN runs.
\\ There are some limitations for this interface:
\begin{itemize}
Expand All @@ -150,7 +150,7 @@ \subsection{RAVEN Interface}
\\ Similarly to any other code interface, the user provides paths to executables and aliases for sampled variables within the
\xmlNode{Models} block. The \xmlNode{Code} block will contain attributes \xmlAttr{name} and
\xmlAttr{subType}. \xmlAttr{name} identifies that particular \xmlNode{Code} model within RAVEN, and
\xmlAttr{subType} specifies which code interface the model will use (In this case \xmlAttr{subType}=``RAVEN'').
\xmlAttr{subType} specifies which code interface the model will use (In this case \xmlAttr{subType}=``RAVEN'').
The \xmlNode{executable}
block should contain the absolute or relative (with respect to the current working
directory) path to the RAVEN framework script (\textbf{raven\_framework}).
Expand All @@ -167,45 +167,45 @@ \subsection{RAVEN Interface}
\begin{lstlisting}[language=python]
def manipulateScalarSampledVariables(sampledVariables):
"""
This method is aimed to manipulate scalar variables.
The user can create new variables based on the
This method is aimed to manipulate scalar variables.
The user can create new variables based on the
variables sampled by RAVEN
@ In, sampledVariables, dict, dictionary of
@ In, sampledVariables, dict, dictionary of
sampled variables ({"var1":value1,"var2":value2})
@ Out, None, the new variables should be
@ Out, None, the new variables should be
added in the "sampledVariables" dictionary
"""
newVariableValue =
sampledVariables['Distributions|Uniform@name:a_dist|lowerBound']
newVariableValue =
sampledVariables['Distributions|Uniform@name:a_dist|lowerBound']
+ 1.0
sampledVariables['Distributions|Uniform@name:a_dist|upperBound'] =
sampledVariables['Distributions|Uniform@name:a_dist|upperBound'] =
newVariableValue
return
\end{lstlisting}

\item \textbf{\textit{convertNotScalarSampledVariables}}, a method that is aimed to convert not scalar variables (e.g. 1D arrays) into multiple scalar variables
\item \textbf{\textit{convertNotScalarSampledVariables}}, a method that is aimed to convert not scalar variables (e.g. 1D arrays) into multiple scalar variables
(e.g. \xmlNode{constant}(s) in a sampling strategy).
This method is going to be required in case not scalar variables are detected by the interface.
This method is going to be required in case not scalar variables are detected by the interface.
Example:
\begin{lstlisting}[language=python]
def convertNotScalarSampledVariables(noScalarVariables):
"""
This method is aimed to convert not scalar
This method is aimed to convert not scalar
variables into multiple scalar variables. The user MUST
create new variables based on the not Scalar Variables
sampled (and passed in) by RAVEN
@ In, noScalarVariables, dict, dictionary of sampled
@ In, noScalarVariables, dict, dictionary of sampled
variables that are not scalar ({"var1":1Darray1,"var2":1Darray2})
@ Out, newVars, dict, the new variables that have
been created based on the not scalar variables
@ Out, newVars, dict, the new variables that have
been created based on the not scalar variables
contained in "noScalarVariables" dictionary
"""
oneDimensionalArray =
noScalarVariables['temperatureHistory']
newVars = {}
for cnt, value in enumerate(oneDimensionalArray):
newVars['Samplers|MonteCarlo@name:myMC|constant'+
'@name=temperatureHistory'+str(cnt)] =
'@name=temperatureHistory'+str(cnt)] =
oneDimensionalArray[cnt]
return newVars
\end{lstlisting}
Expand All @@ -225,7 +225,7 @@ \subsection{RAVEN Interface}
</Code>
\end{lstlisting}

Like for every other interface, the syntax of the variable names is important to make the parser understand how to perturb an input file.
Like for every other interface, the syntax of the variable names is important to make the parser understand how to perturb an input file.
\\ For the RAVEN interface, a syntax inspired by the XPath nomenclature is used.
\begin{lstlisting}[style=XML]
<Samplers>
Expand All @@ -251,7 +251,7 @@ \subsection{RAVEN Interface}
\begin{lstlisting}[style=XML]
<Models>
<ROM name="ROM1" subType="SciKitLearn">
...
...
<C>10.0</C>
...
</ROM>
Expand All @@ -261,7 +261,7 @@ \subsection{RAVEN Interface}
\begin{lstlisting}[style=XML]
<Models>
<ROM name="ROM1" subType="SciKitLearn">
...
...
<tol>0.0001</tol>
...
</ROM>
Expand All @@ -275,9 +275,9 @@ \subsection{RAVEN Interface}
<variable name="var1">
...
<grid construction="equal" type="value" steps="1">0 1</grid>
...
...
</variable>

...
</MonteCarlo>
</Samplers>
Expand Down Expand Up @@ -1303,20 +1303,20 @@ \subsubsection{Models}
...
</Simulation>
\end{lstlisting}
RAVEN works best with Comma-Separated Value (CSV) files. Therefore, the default
RAVEN works best with Comma-Separated Value (CSV) files. Therefore, the default
.mat output type needs to be converted to .csv output.
The Dymola interface will automatically convert the .mat output to human-readable
The Dymola interface will automatically convert the .mat output to human-readable
forms, i.e., .csv output, through its implementation of the finalizeCodeOutput function.
\\In order to speed up the reading and conversion of the .mat file, the user can specify
the list of variables (in addition to the Time variable) that need to be imported and
converted into a csv file minimizing
the IO memory usage as much as possible. Within the \xmlNode{Code} the following
XML
the list of variables (in addition to the Time variable) that need to be imported and
converted into a csv file minimizing
the IO memory usage as much as possible. Within the \xmlNode{Code} the following
XML
node (in addition ot the \xmlNode{executable} one) can be inputted:

\begin{itemize}
\item \xmlNode{outputVariablesToLoad}, \xmlDesc{space separated list, optional
parameter}, a space separated list of variables that need be exported from the .mat
\item \xmlNode{outputVariablesToLoad}, \xmlDesc{space separated list, optional
parameter}, a space separated list of variables that need be exported from the .mat
file (in addition to the Time variable). \default{all the variables in the .mat file}.
\end{itemize}
For example:
Expand Down
69 changes: 60 additions & 9 deletions framework/DataObjects/XDataSet.py
Original file line number Diff line number Diff line change
Expand Up @@ -1711,16 +1711,22 @@ def _usePandasWriteCSV(self,fileName,data,ordered,keepSampleTag=False,keepIndex=
if not localIndex:
data.to_csv(fileName+'.csv',mode=mode,header=header, index=localIndex)
else:
##FIXME: This is extremely bad and not elegant
##FIXME: It is just needed to go on with the "regolding" of the tests
dataString = data.to_string()
data.to_csv(fileName+'.csv',mode=mode,header=header)
## START garbled index fix ##
## At one point we were seeing "garbled" indexes printed from Pandas: a,b,(RAVEN_sample_ID,),c
## Here, commented is a workaround that @alfoa set up to prevent that problem.
## However, it is painfully slow, so if garbled data shows up again, we can
## revisit this fix.
## When using this fix, comment out the data.to_csv line above.
#dataString = data.to_string()
# find headers
splitted = [",".join(elm.split())+"\n" for elm in data.to_string().split("\n")]
header, stringData = splitted[0:2], splitted[2:]
header.reverse()
toPrint = [",".join(header).replace("\n","")+"\n"]+stringData
with open(fileName+'.csv', mode='w+') as fileObject:
fileObject.writelines(toPrint)
#splitted = [",".join(elm.split())+"\n" for elm in data.to_string().split("\n")]
#header, stringData = splitted[0:2], splitted[2:]
#header.reverse()
#toPrint = [",".join(header).replace("\n","")+"\n"]+stringData
#with open(fileName+'.csv', mode='w+') as fileObject:
# fileObject.writelines(toPrint)
## END garbled index fix ##
# if keepIndex, then print as is
elif keepIndex:
data.to_csv(fileName+'.csv',mode=mode,header=header)
Expand All @@ -1729,6 +1735,51 @@ def _usePandasWriteCSV(self,fileName,data,ordered,keepSampleTag=False,keepIndex=
data.to_csv(fileName+'.csv',index=False,mode=mode,header=header)
#raw_input('Just wrote to CSV "{}.csv", press enter to continue ...'.format(fileName))

# _useNumpyWriteCSV (below) is a secondary method to write out POINT SET CSVs. When benchmarked with Pandas, I tested using
# different numbers of variables (M=5,25,100) and different numbers of realizations (R=4,100,1000).
# For each test, I did a unit check just on _usePandasWriteCSV versus _useNumpyWriteCSV, and took the average time
# to run a trial over 1000 trials (in seconds). The results are as follows:
# R M pandas numpy ratio per float p per float n per float ratio
# 4 5 0.001748 0.001004 1.741035857 0.00008740 0.00005020 1.741035857
# 4 25 0.002855 0.001378 2.071843251 0.00002855 0.00001378 2.071843251
# 4 100 0.007006 0.002633 2.660843145 0.00001752 6.5825E-06 2.660843145
# 100 5 0.001982 0.001819 1.089609676 0.00000396 0.00000364 1.089609676
# 100 25 0.003922 0.003898 1.006182658 1.5688E-06 1.5592E-06 1.006182658
# 100 100 0.011124 0.011386 0.976989285 1.1124E-06 1.1386E-06 0.976989285
# 1000 5 0.004108 0.008688 0.472859116 8.2164E-07 1.7376E-06 0.472859116
# 1000 25 0.013367 0.027660 0.483261027 5.3468E-07 1.1064E-06 0.483261027
# 1000 100 0.048791 0.095213 0.512442602 4.8791E-07 9.5213E-07 0.512442602
# The per-float columns divide the time taken by (R*M) to give a fair comparison. The summary of the # var versus # realizations per float is:
# ---------- R ----------------
# M 4 100 1000
# 5 1.741035857 1.089609676 0.472859116
# 25 2.071843251 1.006182658 0.483261027
# 100 2.660843145 0.976989285 0.512442602
# When the value is > 1, numpy is better (so when < 1, pandas is better). It seems that "R" is a better
# indicator of which method is better, and R < 100 is a fairly simple case that is pretty fast anyway,
# so for now we just keep everything using Pandas. - talbpaul and alfoa, January 2018
#
#def _useNumpyWriteCSV(self,fileName,data,ordered,keepSampleTag=False,keepIndex=False,mode='w'):
# # TODO docstrings
# # TODO assert point set -> does not work right for ND (use Pandas)
# # TODO the "mode" should be changed for python 3: mode has to be 'ba' if appending, not 'a' when using numpy.savetxt
# with open(fileName+'.csv',mode) as outFile:
# if mode == 'w':
# #write header
# header = ','.join(ordered)
# else:
# header = ''
# #print('DEBUGG data:',data[ordered])
# data = data[ordered].to_array()
# if not keepSampleTag:
# data = data.drop(self.sampleTag)
# data = data.values.transpose()
# # set up formatting for types
# # TODO potentially slow loop
# types = list('%.18e' if self._getCompatibleType(data[0][i]) == float else '%s' for i in range(len(ordered)))
# np.savetxt(outFile,data,header=header,fmt=types)
# # format data?


### HIERARCHICAL STUFF ###
def _constructHierPaths(self):
Expand Down

0 comments on commit ab03f6e

Please sign in to comment.