Skip to content

Commit

Permalink
Changes in the build scripts and steps:
Browse files Browse the repository at this point in the history
- changed the output file naming;
- made some utility/config methods non-public, etc;
- changes in merge/postmerge: will not generate/convert 'Detailed' output anymore;
- added export.sh instead of generating it during 'postmerge'';
- replaced cpath2.sh with run.sh and build.sh, made all .sh executable;
- removed unused .dockerignore;
- updated readme;
  • Loading branch information
IgorRodchenkov committed May 17, 2024
1 parent 7289d4f commit 64587fd
Show file tree
Hide file tree
Showing 19 changed files with 215 additions and 372 deletions.
6 changes: 0 additions & 6 deletions .dockerignore

This file was deleted.

45 changes: 23 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,10 @@ Alternatively, can run/debug the demo/dev app as:

### Working directory

Directory 'work' is where the configuration, data files and indices are saved.
Directory e.g. './work' is defined by CPATH2_HOME environment property;
this is where the configuration, data, and indices are or will be saved.

cd work
cd ./work

The directory may contain:
- application.properties (to configure various server and model options);
Expand All @@ -80,35 +81,35 @@ The directory may contain:

To see available commands and options, run:

bash cpath2.sh
./run.sh

In order to create a new cpath2 instance, define or update the metadata.json,
In order to prepare/create a new app/data instance, edit metadata.json,
prepare input data archives (see below how), also install `jq`, `gunzip`,
and run:

bash cpath2.sh --build 2>&1 >build.log &
./build.sh 2>&1 >build.log &

, which normally takes a day or two - executes the following data integration steps:
import the metadata, clean, convert to BioPAX, normalize the data, build the BioPAX warehouse,
merge into the main BioPAX model, create Lucene index, several summary files and `export.sh` script to convert
the final BioPAX models to SIF, GMT, TXT formats.
which takes about half a day (and uses about 60Gb of RAM)
executing the integration steps (PREMERGE, MERGE, POSTMERGE):
- import the metadata
- transform (clean, convert, normalize) input data
- build the intermediate BioPAX Warehouse model (from ChEBI, Uniprot and custom id-mapping files)
- merge all the preprocessed input BioPAX models into the main BioPAX model (pc-biopax)
- create full-text index of the pc-biopax model (index also includes id-mapping to chebi,uniprot for internal use)
- create uniprot.txt, datasources.txt, blacklist.txt (used for converting BioPAX to SIF)

cd downloads

Copy the latest paxtools.jar into this directory and run -
Run export.sh to convert the main pc-biopax model to SIF, GMT, TXT formats and also
generate summary files about the bio pathways and physical entities.

sh export.sh 2>&1 >export.log &
./export.sh 2>&1 >export.log &

(- which takes overnight or a day and night); upload/copy/move (but keep at least blacklist.txt, *All.BIOPAX.owl.gz)
all the files from this here and ../data/ directories to the file server, or configure so that they can be downloaded
from e.g. `http://www.pathwaycommons.org/archives/PC2/v{version_number}` (or else).
(- which takes a couple of days...).

Once the instance is configured and data processed, run the web service using the same
script as follows:
Once the instance is configured and data processed, run the web service e.g. as follows:

nohup bash cpath2.sh --server 2>&1 &
./run.sh --server 2>&1 &

(watch the `nohup.out` and `cpath2.log`)
(watch `cpath2.log`)

### Metadata

Expand Down Expand Up @@ -141,8 +142,8 @@ only ZIP archive (can contain multiple files) is currently supported by cPath2 d

Note: BioPAX L1, L2 models will be auto-converted to the BioPAX Level3.

Prepare original BioPAX and PSI-MI/PSI-MITAB data archives in the 'data' folder as follows:
- download (wget) original files or archives from the pathway resource (e.g., `wget http://www.reactome.org/download/current/biopax3.zip`)
Prepare original BioPAX and PSI-MI/PSI-MITAB data archives in the './data' dir as follows:
- download an original file/archive from the pathway resource (e.g., `wget http://www.reactome.org/download/current/biopax3.zip`)
- extract what you need (e.g. some species data only)
- create a new zip archive using name like `<IDENTIFIER>.zip` (datasource identifier, e.g., `reactome_human.zip`).

Expand Down
7 changes: 2 additions & 5 deletions src/main/java/cpath/analysis/EntityFeaturesSummary.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,10 @@
* Prints out а summary of the main merged BioPAX model
* (just for development and debugging).
*
* This class is not an essential part of this system;
* it's to be optionally called using the cpath2.sh script,
* i.e., via {@link ConsoleApplication} '-run-analysis' command.
* This can be used with {@link ConsoleApplication} '--analyze' ('-a') command.
*
* This uses a Java option: cpath.analysis.filter.metadataids=id1,id2,..
* (if not defined, all data are analyzed; IDs are cpath2 metadata/datasource identifiers,
* i.e., not providers' names)
* if not defined, all data are analyzed; id1,id2,.. are datasource identifiers (see: metadata.json).
*
* @author rodche
*/
Expand Down
177 changes: 42 additions & 135 deletions src/main/java/cpath/service/ConsoleApplication.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import cpath.service.jaxb.SearchHit;
import cpath.service.jaxb.SearchResponse;
import cpath.service.metadata.Datasource;
import cpath.service.metadata.Datasource.METADATA_TYPE;

import org.apache.commons.cli.*;
import org.apache.commons.io.IOUtils;
Expand Down Expand Up @@ -44,10 +43,6 @@
@Profile({"admin"})
public class ConsoleApplication implements CommandLineRunner {
private static final Logger LOG = LoggerFactory.getLogger(ConsoleApplication.class);
private static final String JAVA_PAXTOOLS = "$JAVA_HOME/bin/java -Xmx64g " +
"-Dpaxtools.CollectionProvider=org.biopax.paxtools.trove.TProvider " +
"-Dpaxtools.normalizer.use-latest-registry=true " +
"-Dpaxtools.core.use-latest-genenames=true -jar paxtools.jar";

@Autowired
private Service service;
Expand Down Expand Up @@ -193,26 +188,34 @@ private void premerge() {
premerger.premerge();
// create the Warehouse BioPAX model and id-mapping db table
if (!Files.exists(Paths.get(service.settings().warehouseModelFile()))) {
premerger.buildWarehouse();
premerger.buildWarehouse(); //also makes/saves id-mapping docs in the lucene index via service.mapping() DAO
}
}

private void merge() {
Merger biopaxMerger = new Merger(service);
biopaxMerger.merge();

LOG.info("Indexing BioPAX Model (this may take an hour or so)...");
service.index().save(service.getModel());

LOG.info("Generating blacklist.txt...");
//Generates, if not exist, the blacklist.txt -
//to exclude/keep ubiquitous small molecules (e.g. ATP)
//from graph query and output format converter results.
BlacklistGenerator3 gen = new BlacklistGenerator3();
Blacklist blacklist = gen.generateBlacklist(service.getModel());
// Write all the blacklisted ids to the output
if (blacklist != null) {
blacklist.write(service.settings().blacklistFile());
if (!Files.exists(Paths.get(service.settings().mainModelFile()))) {
Merger biopaxMerger = new Merger(service);
//each normalized datasource model is further improved with Warehouse and id-mapping and merged into the main...
biopaxMerger.merge(); //also saves it
} else {
LOG.info("Found {}...", service.settings().mainModelFile());
}

service.init(); // reload the model, index, blacklist if exists...

if(service.getBlacklist() == null) {
LOG.info("Generating the list of ubiquitous small molecules, {}...", service.settings().blacklistFile());
//Generate the blacklist.txt to exclude/keep ubiquitous small molecules (e.g. ATP)
//from graph query and output format converter results.
BlacklistGenerator3 gen = new BlacklistGenerator3();
Blacklist blacklist = gen.generateBlacklist(service.getModel());
// Write all the blacklisted ids to the output
if (blacklist != null) {
service.setBlacklist(blacklist);
blacklist.write(service.settings().blacklistFile());
}
} else { //means - service.init() loaded it earlier
LOG.info("Found: {} - ok", service.settings().blacklistFile());
}
}

Expand Down Expand Up @@ -266,7 +269,7 @@ private void exportData(final String output, String[] uris, String[] datasources
SimpleIOHandler sio = new SimpleIOHandler(BioPAXLevel.L3);
sio.absoluteUris(true); // write full URIs
sio.convertToOWL(model, os, uris);
IOUtils.closeQuietly(os);//though, convertToOWL must have done this already
IOUtils.closeQuietly(os, null);//though, convertToOWL must have done this already
}

private Class<? extends BioPAXElement> biopaxTypeFromSimpleName(String type) {
Expand All @@ -283,10 +286,9 @@ private Class<? extends BioPAXElement> biopaxTypeFromSimpleName(String type) {

private void postmerge() throws IOException {
LOG.info("postmerge: started");
// Updates counts of pathways, etc. and saves in the Metadata table.
// This depends on the full-text index created already

// Update the counts of pathways, interactions, participants per data source and save.
LOG.info("updating pathway/interaction/participant counts per data source...");
// update counts for each non-warehouse metadata entry
for (Datasource ds : service.metadata().getDatasources()) {
ds.getFiles().clear(); //do not export to json
if(ds.getType().isNotPathwayData()) {
Expand All @@ -299,12 +301,7 @@ private void postmerge() throws IOException {
}
CPathUtils.saveMetadata(service.metadata(), service.settings().getMetadataLocation()); //update the json file

//init the service - load main model and index
service.init();
final Model mainModel = service.getModel();
LOG.info("loaded main model:{} biopax elements", mainModel.getObjects().size());

// create an imported data summary file.txt (issue#23)
// Generate datasources.txt summary file (issue#23)
PrintWriter writer = new PrintWriter(new OutputStreamWriter(Files.newOutputStream(
Paths.get(service.settings().downloadsDir(), "datasources.txt")), StandardCharsets.UTF_8)
);
Expand All @@ -322,20 +319,23 @@ private void postmerge() throws IOException {
}
writer.flush();
writer.close();
LOG.info("done datasources.txt");
LOG.info("generated datasources.txt");

service.init(); // load/reload the main model, index, etc.

//this was to integrate with UniProt portal/data - to add/update their external links to PathwayCommons apps...
LOG.info("creating the list of primary uniprot ACs...");
Set<String> acs = new TreeSet<>();
//exclude publication xrefs
Set<Xref> xrefs = new HashSet<>(mainModel.getObjects(UnificationXref.class));
xrefs.addAll(mainModel.getObjects(RelationshipXref.class));
for (Xref x : xrefs) {
String id = x.getId();
if (CPathUtils.startsWithAnyIgnoreCase(x.getDb(), "uniprot")
&& id != null && !acs.contains(id)) {
acs.addAll(service.map(List.of(id), "UNIPROT"));
}
}
service.getModel().getObjects(Xref.class)
.stream()
.filter(x -> !(x instanceof PublicationXref)) //except for publication xrefs
.forEach(x -> {
String id = x.getId();
if (CPathUtils.startsWithAnyIgnoreCase(x.getDb(), "uniprot")
&& id != null && !acs.contains(id)) {
acs.addAll(service.map(List.of(id), "UNIPROT"));
}
});
writer = new PrintWriter(new OutputStreamWriter(Files.newOutputStream(
Paths.get(service.settings().downloadsDir(), "uniprot.txt")), StandardCharsets.UTF_8)
);
Expand All @@ -347,62 +347,9 @@ private void postmerge() throws IOException {
writer.close();
LOG.info("generated uniprot.txt");

LOG.info("init the full-text search engine...");
final Index index = new IndexImpl(mainModel, service.settings().indexDir(), false);
// generate the "Detailed" pathway data file:
createDetailedBiopax(mainModel, index);

// Generate export.sh script (to convert the data/model to other formats; we then run this script separately)
LOG.info("generating script: {}...", service.settings().exportScriptFile());
final String commonPrefix = service.settings().exportArchivePrefix();
writer = new PrintWriter(new OutputStreamWriter(Files.newOutputStream(
Paths.get(service.settings().exportScriptFile())), StandardCharsets.UTF_8));
//begin writing shell commands
writer.println("""
#!/bin/sh
# A script for converting the BioPAX data in this dir to other formats.
# There must be blacklist.txt and paxtools.jar files already.
""");
//writeScriptCommands(service.settings().biopaxFileName("Detailed"), writer, true); //skip; users can download files and convert later if they want.
writeScriptCommands(service.settings().biopaxFileName("All"), writer, true);
//rename SIF files that were cut from corresponding extended SIF (.txt) ones
writer.println("rename 's/txt\\.sif/sif/' *.txt.sif");
writer.println(String.format("gzip %s.*.txt %s.*.sif %s.*.gmt %s.*.xml",
commonPrefix, commonPrefix, commonPrefix, commonPrefix));
//generate pathways.txt (parent-child) and physical_entities.json (URI-to-IDs mapping) files
writer.println(String.format("%s %s '%s' '%s' %s 2>&1", JAVA_PAXTOOLS, "summarize",
service.settings().biopaxFileName("All"), "pathways.txt", "--pathways"));
//generate the list of physical entities (some uri, names, ids) as json array:
writer.println(String.format("%s %s '%s' '%s' %s 2>&1", JAVA_PAXTOOLS, "summarize",
service.settings().biopaxFileName("All"), "physical_entities.json", "--uri-ids"));
//filter and convert just created above file to the json map of only "generic" PEs:
writer.println("""
cat physical_entities.json | jq -cS 'map(select(.generic)) | reduce .[] as $o ({}; . + {($o.uri): {name: $o.name, label:$o.label, synonyms:$o."hgnc.symbol"}})' > generic-physical-entity-map.json
gzip pathways.txt physical_entities.json
echo "Export completed!"
""");
writer.close();

LOG.info("postmerge: done.");
}

private void writeScriptCommands(String bpFilename, PrintWriter writer, boolean exportToGSEA) {
//make output file name prefix that includes datasource and ends with '.':
final String prefix = bpFilename.substring(0, bpFilename.indexOf("BIOPAX."));
final String commaSepTaxonomyIds = String.join(",", service.settings().getOrganismTaxonomyIds());
if (exportToGSEA) {
writer.println(String.format("%s %s '%s' '%s' %s 2>&1", JAVA_PAXTOOLS, "toGSEA", bpFilename,
prefix + "hgnc.gmt", "'hgnc.symbol' 'organisms=" + commaSepTaxonomyIds + "'"));//'hgnc.symbol' - important
// writer.println(String.format("%s %s '%s' '%s' %s 2>&1", JAVA_PAXTOOLS, "toGSEA", bpFilename,
// prefix + "uniprot.gmt", "'uniprot' 'organisms=" + commaSepTaxonomyIds + "'"));
writer.println("echo \"Converted " + bpFilename + " to GSEA.\"");
}
writer.println(String.format("%s %s '%s' '%s' %s 2>&1", JAVA_PAXTOOLS, "toSIF", bpFilename,
prefix + "hgnc.txt", "seqDb=hgnc -extended -andSif exclude=neighbor_of"));
//UniProt based xSIF files can be huge and take too long (2 days) to generate; skip for now.
writer.println("echo \"Converted " + bpFilename + " to SIF.\"");
}

private Collection<String> findAllUris(Index index, Class<? extends BioPAXElement> type, String[] ds, String[] org) {
Collection<String> uris = new ArrayList<>();
SearchResponse resp = index.search("*", 0, type, ds, org);
Expand All @@ -418,44 +365,4 @@ private Collection<String> findAllUris(Index index, Class<? extends BioPAXElemen
return uris;
}

private void createDetailedBiopax(final Model mainModel, Index index) {
//collect BioPAX pathway data source names
final Set<String> pathwayDataSources = new HashSet<>();
for (Datasource md : service.metadata().getDatasources()) {
if (md.getType() == METADATA_TYPE.BIOPAX) {
pathwayDataSources.add(md.standardName());
}
}
final String archiveName = service.settings().biopaxFileNameFull("Detailed");
exportBiopax(mainModel, index, archiveName, pathwayDataSources.toArray(new String[]{}), null);
}

private void exportBiopax(Model mainModel, Index index, String biopaxArchive,
String[] datasources, String[] organisms) {
// check file exists
if (!(new File(biopaxArchive)).exists()) {
LOG.info("creating new " + biopaxArchive);
try {
//find all entities (all child elements will be then exported too)
Collection<String> uris = new HashSet<>();
uris.addAll(findAllUris(index, Pathway.class, datasources, organisms));
uris.addAll(findAllUris(index, Interaction.class, datasources, organisms));
uris.addAll(findAllUris(index, Complex.class, datasources, organisms));
// export objects found above to a new biopax archive
if (!uris.isEmpty()) {
OutputStream os = new GZIPOutputStream(new FileOutputStream(biopaxArchive));
SimpleIOHandler sio = new SimpleIOHandler(BioPAXLevel.L3);
sio.convertToOWL(mainModel, os, uris.toArray(new String[]{}));
LOG.info("successfully created " + biopaxArchive);
} else {
LOG.info("no pathways/interactions found; skipping " + biopaxArchive);
}
} catch (IOException e) {
throw new RuntimeException(e);
}
} else {
LOG.info("skipped due to file already exists: " + biopaxArchive);
}
}

}
Loading

0 comments on commit 64587fd

Please sign in to comment.