Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3894 async information extraction #3908

Merged
merged 63 commits into from
Nov 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
efa4222
WIP
gabriel-piles Sep 1, 2021
1564051
intial test for the taskManager
konzz Sep 1, 2021
a7a52c9
RedisServer
konzz Sep 2, 2021
d13475a
TaskManager WIP
konzz Sep 4, 2021
f7af6f4
Taskmanager tests with external dummy server WIP
konzz Sep 6, 2021
bd6c1a8
Task manager simple version working
gabriel-piles Sep 7, 2021
563ed50
refactor to the Repeater to be a class and dont share the stop variable
konzz Sep 8, 2021
7c5ca67
TaskManage using Repeater to listen for messages
konzz Sep 8, 2021
8415528
Refactored Repeater specs to work with fakeTimers
konzz Sep 9, 2021
7d8334f
RepeaterWithLock for locking distributed Uwazi tasks
konzz Sep 9, 2021
e65abf4
Repeat with lock handles redis unavailable
gabriel-piles Sep 10, 2021
abc31e1
Handle error in task and add delay between tasks parameter in repeat …
gabriel-piles Sep 13, 2021
5f38b0b
Repeater specs cleanup WIP
konzz Sep 13, 2021
3c3a78b
Clean up tests for repeater with lock
gabriel-piles Sep 14, 2021
54c31b8
Cleanup of consoloe logs and fixed some eslint errors
konzz Sep 14, 2021
ea7e87e
Task manager refactor to avoid factory
gabriel-piles Sep 15, 2021
2d3266b
TaskManager refactor and redis unavailable test
konzz Sep 15, 2021
3adb54b
Handling redis errors in taskmanager
konzz Sep 15, 2021
2adf80c
Segmentator task working for one file
gabriel-piles Sep 16, 2021
baaf6c2
Segment many files
gabriel-piles Sep 16, 2021
83bafcf
Use information information extraction for segmenting
gabriel-piles Sep 20, 2021
4fcd276
Multitenant test for pdf segmetnation
gabriel-piles Sep 20, 2021
4558d53
exposing the fixturer
konzz Sep 21, 2021
8a6f365
PDF segmentation runing multitenant WIP
konzz Sep 21, 2021
b388402
TODOs segmentation
konzz Sep 21, 2021
e36efa5
PDF segmentation storing the segmentation place holder
konzz Oct 5, 2021
13ea822
Moved all files under services folder, and refactor Repeater specs so…
konzz Oct 6, 2021
b27d761
Aggregation to only get files that need to be processed
konzz Oct 7, 2021
e2673ed
Taskmanager count, and some pending tests for the PDFsegmentation
konzz Oct 13, 2021
2bdb11c
PDFsegmentation doing nothing when there is already pending tasks
konzz Oct 13, 2021
f8df201
Segmentator sending the PDFs instead of the TasdkManager
konzz Oct 13, 2021
8ed5f58
Handling tenants with no segmentation config
konzz Oct 13, 2021
89a9170
TaskManager only passing the results message to the connector instead…
konzz Oct 14, 2021
efa9a78
Uwazi server using a repeater to Segment PDFs
konzz Oct 15, 2021
3ec154b
small refactor and rename
konzz Oct 18, 2021
a4f556a
fixed some eslint errors and specs
konzz Oct 18, 2021
4ab94f7
fixed some eslint errors
konzz Oct 18, 2021
73543cf
Fixed instalation of redis server for tests
konzz Oct 18, 2021
16fd346
removed redis-server
konzz Oct 18, 2021
cb4515d
Some refactor and atempting to fix CI redis instalation
konzz Oct 18, 2021
277b4a8
Moved redis installation to a jest pre hook
konzz Oct 18, 2021
2fd8a54
removed unused variable
konzz Oct 18, 2021
52fcb9b
fixed an issue with downloadRedis
konzz Oct 18, 2021
ab41bbe
refactored PDFsegmentation to segment all files regardless of template
konzz Oct 19, 2021
127ba72
Fixed some errors typing
konzz Oct 19, 2021
6692524
General refactor, fixed PDFsegemntation tests database setup
konzz Oct 19, 2021
9a96c3a
deleted unused var
konzz Oct 19, 2021
3825a84
renamed RepeatWith to DistributedLoop
konzz Oct 19, 2021
0b977f7
Changed task structure to contain params
konzz Oct 19, 2021
f260d8b
PDFsegmentation service requesting and storing the xml
konzz Oct 20, 2021
fd5c112
some error handling
konzz Oct 20, 2021
10ecc20
deleted unused var
konzz Oct 20, 2021
866d020
fixed type
konzz Oct 20, 2021
48cce93
Always deleting result message when an error happens
konzz Oct 21, 2021
12e9a9e
only save the error when the segmentation placeholder existed to avoi…
konzz Oct 22, 2021
62ff8cf
Using getFileContents instead of readFileSync
konzz Oct 26, 2021
9187b54
Segementation type and using filesystem instead of fs
konzz Oct 26, 2021
2bd25d0
Error handling for missing files
konzz Oct 27, 2021
18bb88e
Storing only one process when the segmentation fails
konzz Oct 28, 2021
9469669
adding tenant to service url
konzz Oct 28, 2021
cc84a09
fixed types error
konzz Oct 28, 2021
7b9e3b5
sgementation service behind a config flag
konzz Nov 8, 2021
a272d02
Merge branch 'development' into 3894-async-information-extraction
konzz Nov 9, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,6 @@ custom_uploads/*
test
app/api/files/specs/file1
app/api/files/specs/file2
**/redis-bin
dump.rdb

3 changes: 2 additions & 1 deletion app/api/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,11 @@ export const config = {
customUploads: CUSTOM_UPLOADS_FOLDER || `${rootPath}/custom_uploads/`,
temporalFiles: TEMPORAL_FILES_FOLDER || `${rootPath}/temporal_files/`,
},
externalServices: Boolean(process.env.EXTERNAL_SERVICES) || false,

redis: {
activated: CLUSTER_MODE,
host: process.env.REDIS_HOST || 'localhost',
port: process.env.REDIS_PORT || 6379,
port: parseInt(process.env.REDIS_PORT || '', 10) || 6379,
},
};
3 changes: 3 additions & 0 deletions app/api/files/filesystem.ts
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,8 @@ const streamToString = async (stream: Readable): Promise<string> =>
const getFileContent = async (fileName: FilePath): Promise<string> =>
asyncFS.readFile(uploadsPath(fileName), 'utf8');

const readFile = async (fileName: FilePath): Promise<Buffer> => asyncFS.readFile(fileName);

export {
setupTestUploadedPaths,
deleteUploadedFiles,
Expand All @@ -154,4 +156,5 @@ export {
activityLogPath,
writeFile,
appendFile,
readFile,
};
198 changes: 198 additions & 0 deletions app/api/services/pdfsegmentation/PDFSegmentation.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
import { TaskManager, ResultsMessage } from 'api/services/tasksmanager/TaskManager';
import { uploadsPath, fileFromReadStream, createDirIfNotExists, readFile } from 'api/files';
import { Readable } from 'stream';
import urljoin from 'url-join';
import filesModel from 'api/files/filesModel';
import path from 'path';
import { FileType } from 'shared/types/fileType';
import { Settings } from 'shared/types/settingsType';
import settings from 'api/settings/settings';
import { tenants } from 'api/tenants/tenantContext';
import { ObjectIdSchema } from 'shared/types/commonTypes';
import request from 'shared/JSONRequest';
import { handleError } from 'api/utils';
import { SegmentationModel } from './segmentationModel';

class PDFSegmentation {
SERVICE_NAME = 'segmentation';

public segmentationTaskManager: TaskManager;

templatesWithInformationExtraction: string[] | undefined;

features: Settings | undefined;

batchSize = 10;

constructor() {
this.segmentationTaskManager = new TaskManager({
serviceName: this.SERVICE_NAME,
processResults: this.processResults,
});
}

segmentOnePdf = async (
file: FileType & { filename: string; _id: ObjectIdSchema },
serviceUrl: string,
tenant: string
) => {
try {
const fileContent = await readFile(uploadsPath(file.filename));
await request.uploadFile(urljoin(serviceUrl, tenant), file.filename, fileContent);

await this.segmentationTaskManager.startTask({
task: this.SERVICE_NAME,
tenant,
params: {
filename: file.filename,
},
});

await this.storeProcess(file._id, file.filename);
} catch (err) {
if (err.code === 'ENOENT') {
await this.storeProcess(file._id, file.filename, false);
handleError(err);
return;
}

throw err;
}
};

storeProcess = async (fileID: ObjectIdSchema, filename: string, proccessing = true) =>
SegmentationModel.save({
fileID,
filename,
status: proccessing ? 'processing' : 'failed',
});

getFilesToSegment = async (): Promise<FileType & { filename: string; _id: ObjectIdSchema }[]> =>
filesModel.db.aggregate([
{
$match: {
type: 'document',
filename: { $exists: true },
},
},
{
$lookup: {
from: 'segmentations',
localField: '_id',
foreignField: 'fileID',
as: 'segmentation',
},
},
{
$match: {
segmentation: {
$size: 0,
},
},
},
{
$limit: this.batchSize,
},
]);

segmentPdfs = async () => {
const pendingTasks = await this.segmentationTaskManager.countPendingTasks();
if (pendingTasks > 0) {
return;
}

try {
await Promise.all(
Object.keys(tenants.tenants).map(async tenant => {
await tenants.run(async () => {
const settingsValues = await settings.get();
const segmentationServiceConfig = settingsValues?.features?.segmentation;

if (!segmentationServiceConfig) {
return;
}

const filesToSegment = await this.getFilesToSegment();

for (let i = 0; i < filesToSegment.length; i += 1) {
// eslint-disable-next-line no-await-in-loop
await this.segmentOnePdf(filesToSegment[i], segmentationServiceConfig.url, tenant);
}
}, tenant);
})
);
} catch (err) {
if (err.code === 'ECONNREFUSED') {
await new Promise(resolve => {
setTimeout(resolve, 60000);
});
}
handleError(err, { useContext: false });
}
};

requestResults = async (message: ResultsMessage) => {
const response = await request.get(message.data_url);
const fileStream = (await fetch(message.file_url!)).body;

if (!fileStream) {
throw new Error(
`Error requesting for segmentation file: ${message.params!.filename}, tenant: ${
message.tenant
}`
);
}
return { data: JSON.parse(response.json), fileStream: (fileStream as unknown) as Readable };
};

storeXML = async (filename: string, fileStream: Readable) => {
const folderPath = uploadsPath(this.SERVICE_NAME);
await createDirIfNotExists(folderPath);
const xmlname = `${path.basename(filename, path.extname(filename))}.xml`;

await fileFromReadStream(xmlname, fileStream, folderPath);
};

saveSegmentation = async (filename: string, data: any) => {
const [segmentation] = await SegmentationModel.get({ filename });
// eslint-disable-next-line camelcase
const { paragraphs, page_height, page_width } = data;
await SegmentationModel.save({
...segmentation,
segmentation: { page_height, page_width, paragraphs },
autoexpire: null,
status: 'ready',
});
};

saveSegmentationError = async (filename: string) => {
const [segmentation] = await SegmentationModel.get({ filename });
if (segmentation) {
await SegmentationModel.save({
...segmentation,
filename,
autoexpire: null,
status: 'failed',
});
}
};

processResults = async (message: ResultsMessage): Promise<void> => {
await tenants.run(async () => {
try {
if (!message.success) {
await this.saveSegmentationError(message.params!.filename);
return;
}

const { data, fileStream } = await this.requestResults(message);
await this.storeXML(message.params!.filename, fileStream);
await this.saveSegmentation(message.params!.filename, data);
} catch (error) {
handleError(error);
}
}, message.tenant);
};
}

export { PDFSegmentation };
18 changes: 18 additions & 0 deletions app/api/services/pdfsegmentation/segmentationModel.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import mongoose from 'mongoose';
import { instanceModel } from 'api/odm';
import { SegmentationType } from 'shared/types/segmentationType';

const props = {
autoexpire: { type: Date, expires: 86400, default: Date.now }, // 24 hours
file: { type: mongoose.Schema.Types.ObjectId, ref: 'File' },
status: { type: String, enum: ['processing', 'failed', 'ready'], default: 'processing' },
};

const mongoSchema = new mongoose.Schema(props, {
emitIndexErrors: true,
strict: false,
});

const SegmentationModel = instanceModel<SegmentationType>('segmentations', mongoSchema);

export { SegmentationModel };
Loading