Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3242 slow csv import with thesaurus creation #3927

Merged
merged 26 commits into from
Sep 29, 2021
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
7e0e44b
first implementation
LaszloKecskes Sep 20, 2021
51e7136
added extra test
LaszloKecskes Sep 20, 2021
0de48ce
updated select typeparser
LaszloKecskes Sep 20, 2021
a99bf66
updated multiselect typeparser and failing tests
LaszloKecskes Sep 21, 2021
baa367e
code cleanup
LaszloKecskes Sep 21, 2021
9f5a68e
added error handling
LaszloKecskes Sep 21, 2021
4d01edd
fix code climate issues
LaszloKecskes Sep 21, 2021
b43cfb3
further code cleanup
LaszloKecskes Sep 21, 2021
33ecda7
Merge branch 'development' into 3242_slow_csv_import_with_thesaurus_c…
LaszloKecskes Sep 21, 2021
8cf7de1
removing unneccessary async
LaszloKecskes Sep 21, 2021
1ab2637
removed Readable option from importFile
LaszloKecskes Sep 22, 2021
a6cef0f
corrected failing unit tests
LaszloKecskes Sep 22, 2021
38d36f0
Merge branch '3242_slow_csv_import_with_thesaurus_creation' of https:…
LaszloKecskes Sep 22, 2021
8ae1ef1
added smoke test
LaszloKecskes Sep 22, 2021
af075d5
handling languages
LaszloKecskes Sep 22, 2021
ee3eb2c
updated test with languages
LaszloKecskes Sep 22, 2021
a4bc6c3
corrected missing translation problem
LaszloKecskes Sep 23, 2021
c5f6842
removing eslint line
LaszloKecskes Sep 24, 2021
6e0e903
changing thesauri database query to single
LaszloKecskes Sep 24, 2021
23c12cd
changed error handling on arrangeThesauri
LaszloKecskes Sep 24, 2021
c5f5b1f
refactored arrangeThesauri
LaszloKecskes Sep 24, 2021
ef998da
updated tests
LaszloKecskes Sep 24, 2021
b42630d
changing a test description
LaszloKecskes Sep 24, 2021
4cae862
removed select-multiselect differentiation
LaszloKecskes Sep 28, 2021
ab7986a
removed error catch, added typing, refactored functions
LaszloKecskes Sep 28, 2021
34898b2
Merge branch 'development' into 3242_slow_csv_import_with_thesaurus_c…
daneryl Sep 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions app/api/csv/csvLoader.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
/* eslint-disable max-statements */
import { EventEmitter } from 'events';

import templates from 'api/templates';
Expand All @@ -12,7 +13,7 @@ import { ensure } from 'shared/tsUtils';
import { ObjectId } from 'mongodb';
import csv, { CSVRow } from './csv';
import importFile from './importFile';
import { importEntity, translateEntity } from './importEntity';
import { arrangeThesauri, importEntity, translateEntity } from './importEntity';
import { extractEntity, toSafeName } from './entityRow';

export class CSVLoader extends EventEmitter {
Expand Down Expand Up @@ -55,6 +56,7 @@ export class CSVLoader extends EventEmitter {
(await settings.get()).languages
).map((l: LanguageSchema) => l.key);
const { newNameGeneration = false } = await settings.get();
await arrangeThesauri(file, template, availableLanguages, this);

await csv(await file.readStream(), this.stopOnError)
.onRow(async (row: CSVRow) => {
Expand All @@ -64,7 +66,6 @@ export class CSVLoader extends EventEmitter {
options.language,
newNameGeneration
);

if (rawEntity) {
const entity = await importEntity(rawEntity, template, file, options);
await translateEntity(entity, rawTranslations, template, file);
Expand Down
127 changes: 123 additions & 4 deletions app/api/csv/importEntity.ts
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
/* eslint-disable max-statements */
daneryl marked this conversation as resolved.
Show resolved Hide resolved
import entities from 'api/entities';
import { search } from 'api/search';
import entitiesModel from 'api/entities/entitiesModel';
import { processDocument } from 'api/files/processDocument';
import { RawEntity } from 'api/csv/entityRow';
import { RawEntity, toSafeName } from 'api/csv/entityRow';
import { TemplateSchema } from 'shared/types/templateType';
import { MetadataSchema, PropertySchema } from 'shared/types/commonTypes';
import { propertyTypes } from 'shared/propertyTypes';
import { ImportFile } from 'api/csv/importFile';
import thesauri from 'api/thesauri';
import { EntitySchema } from 'shared/types/entityType';
import { ensure } from 'shared/tsUtils';

import { attachmentsPath, files } from 'api/files';
import { propertyTypes } from 'shared/propertyTypes';
import { generateID } from 'shared/IDGenerator';

import { normalizeThesaurusLabel } from './typeParsers/select';
import { splitMultiselectLabels } from './typeParsers/multiselect';
import typeParsers from './typeParsers';
import csv, { CSVRow } from './csv';

const parse = async (toImportEntity: RawEntity, prop: PropertySchema) =>
typeParsers[prop.type]
Expand Down Expand Up @@ -66,6 +71,120 @@ type Options = {
language: string;
};

const filterJSObject = (input: { [k: string]: any }, keys: string[]): { [k: string]: any } => {
const result: { [k: string]: any } = {};
keys.forEach(k => {
if (input.hasOwnProperty(k)) {
result[k] = input[k];
}
});
return result;
};

const arrangeThesauri = async (
daneryl marked this conversation as resolved.
Show resolved Hide resolved
file: ImportFile,
template: TemplateSchema,
languages?: string[],
errorContext?: any
) => {
let nameToThesauriIdSelects: { [k: string]: string } = {};
let nameToThesauriIdMultiselects: { [k: string]: string } = {};
const thesauriIdToExistingValues = new Map();
const thesauriIdToNewValues: Map<string, Set<string>> = new Map();
const thesauriIdToNormalizedNewValues = new Map();
const thesauriRelatedProperties = template.properties?.filter(p =>
['select', 'multiselect'].includes(p.type)
);
thesauriRelatedProperties?.forEach(p => {
if (p.content && p.type) {
const thesarusID = p.content.toString();
if (p.type === propertyTypes.select) {
nameToThesauriIdSelects[p.name] = thesarusID;
languages?.forEach(suffix => {
nameToThesauriIdSelects[`${p.name}__${suffix}`] = thesarusID;
});
} else if (p.type === propertyTypes.multiselect) {
nameToThesauriIdMultiselects[p.name] = thesarusID;
languages?.forEach(suffix => {
nameToThesauriIdMultiselects[`${p.name}__${suffix}`] = thesarusID;
});
}
}
});
const allRelatedThesauri = await Promise.all(
Array.from(
new Set(thesauriRelatedProperties?.map(p => p.content?.toString()).filter(t => t))
).map(async id => thesauri.getById(id))
);
daneryl marked this conversation as resolved.
Show resolved Hide resolved
allRelatedThesauri.forEach(t => {
if (t) {
const id = t._id.toString();
thesauriIdToExistingValues.set(
id,
new Set(t.values?.map(v => normalizeThesaurusLabel(v.label)))
);
thesauriIdToNewValues.set(id, new Set());
thesauriIdToNormalizedNewValues.set(id, new Set());
}
});
function handleLabels(id: string, original: string, normalized: string | null) {
if (
normalized &&
!thesauriIdToExistingValues.get(id).has(normalized) &&
!thesauriIdToNormalizedNewValues.get(id).has(normalized)
) {
thesauriIdToNewValues.get(id)?.add(original);
thesauriIdToNormalizedNewValues.get(id).add(normalized);
}
}
await csv(await file.readStream(), errorContext?.stopOnError)
daneryl marked this conversation as resolved.
Show resolved Hide resolved
.onRow(async (row: CSVRow, index: number) => {
if (index === 0) {
const columnnames = Object.keys(row);
nameToThesauriIdSelects = filterJSObject(nameToThesauriIdSelects, columnnames);
nameToThesauriIdMultiselects = filterJSObject(nameToThesauriIdMultiselects, columnnames);
}
Object.entries(nameToThesauriIdSelects).forEach(([name, id]) => {
const label = row[name];
if (label) {
const normalizedLabel = normalizeThesaurusLabel(label);
handleLabels(id, label, normalizedLabel);
}
});
Object.entries(nameToThesauriIdMultiselects).forEach(([name, id]) => {
const labels = splitMultiselectLabels(row[name]);
if (labels) {
Object.entries(labels).forEach(([normalizedLabel, originalLabel]) => {
handleLabels(id, originalLabel, normalizedLabel);
});
}
});
})
.onError(async (e: Error, row: CSVRow, index: number) => {
if (errorContext) {
errorContext._errors[index] = e;
errorContext.emit('loadError', e, toSafeName(row), index);
}
})
.read();
for (let i = 0; i < allRelatedThesauri.length; i += 1) {
const thesaurus = allRelatedThesauri[i];
if (thesaurus !== null) {
const newValues: { label: string }[] = Array.from(
thesauriIdToNewValues.get(thesaurus._id.toString()) || []
).map(tval => ({ label: tval }));
if (newValues.length > 0) {
const thesaurusValues = thesaurus.values || [];
// eslint-disable-next-line no-await-in-loop
await thesauri.save({
...thesaurus,
values: thesaurusValues.concat(newValues),
});
}
}
}
};

const importEntity = async (
toImportEntity: RawEntity,
template: TemplateSchema,
Expand Down Expand Up @@ -122,4 +241,4 @@ const translateEntity = async (
await search.indexEntities({ sharedId: entity.sharedId }, '+fullText');
};

export { importEntity, translateEntity };
export { arrangeThesauri, importEntity, translateEntity };
10 changes: 3 additions & 7 deletions app/api/csv/importFile.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import fs from 'fs';
import path from 'path';
import { Readable } from 'stream';

import { generateFileName, fileFromReadStream, uploadsPath } from 'api/files/filesystem';
import { createError } from 'api/utils';
Expand All @@ -17,16 +16,13 @@ const extractFromZip = async (zipPath: string, fileName: string) => {
};

export class ImportFile {
filePath: string | Readable;
filePath: string;

constructor(filePath: string | Readable) {
constructor(filePath: string) {
this.filePath = filePath;
}

async readStream(fileName = 'import.csv') {
if (this.filePath instanceof Readable) {
return this.filePath;
}
if (path.extname(this.filePath) === '.zip') {
return extractFromZip(this.filePath, fileName);
}
Expand All @@ -46,6 +42,6 @@ export class ImportFile {
}
}

const importFile = (filePath: string | Readable) => new ImportFile(filePath);
const importFile = (filePath: string) => new ImportFile(filePath);

export default importFile;
8 changes: 4 additions & 4 deletions app/api/csv/specs/__snapshots__/importFile.spec.ts.snap
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
// Jest Snapshot v1, https://goo.gl/fbAQLP

exports[`importFile readStream should return a readable stream for the csv file 1`] = `
"Title , text label , numeric label, non configured, select label, not defined type, geolocation_geolocation,auto id, additional tag(s)
"Title , text label , numeric label, non configured, select_label, not defined type, geolocation_geolocation,auto id, additional tag(s), multi_select_label

title1, text value 1, 1977, ______________, thesauri1 , notType1 , 1|1,,tag1
title2, text value 2, 2019, ______________, thesauri2 , notType2 , ,,tag2
title3, text value 3, 2020, ______________, thesauri2 , notType3 , 0|0,,tag3
title1, text value 1, 1977, ______________, thesauri1 , notType1 , 1|1,,tag1, multivalue1
title2, text value 2, 2019, ______________, thesauri2 , notType2 , ,,tag2, multivalue2
title3, text value 3, 2020, ______________, thesauri2 , notType3 , 0|0,,tag3, multivalue1|multivalue3
"
`;

Expand Down
17 changes: 17 additions & 0 deletions app/api/csv/specs/arrangeThesauriTest.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
title,unrelated_property,select_property__en, select_property__es,multiselect_property__en, multiselect_property__es
select_1,unrelated_text,B,Bes,A,Aes
select_2,unrelated_text,C,Ces,A,Aes
select_3,unrelated_text,b,bes,A,Aes
select_4,unrelated_text,B,Bes,A,Aes
select_5,unrelated_text,d,des,A,Aes
select_6,unrelated_text,D,Des,A,Aes
select_7,unrelated_text, b,bes,A,Aes
select_8,unrelated_text, , ,A,Aes
select_8,unrelated_text, , ,A,Aes
multiselect_1,unrelated_text,A,Aes,B,Bes
multiselect_2,unrelated_text,A,Aes,c,ces
multiselect_3,unrelated_text,A,Aes,A|b,Aes|bes
multiselect_4,unrelated_text,A,Aes,a|B|C,aes|Bes|Ces
multiselect_5,unrelated_text,A,Aes, a| b | , aes| bes |
multiselect_6,unrelated_text,A,Aes, | | , | |
multiselect_7, unrelated_text,A,Aes,A|B|C|D| |E| e| g ,Aes|Bes|Ces|Des| |Ees| ees| ges
Loading