V.1
Participants: | |
---|---|
Solution Architect | |
Product Owner | |
Java Lead |
Spike Purpose:
Problem:
The current implementation of the data-import does not provide an automated solution to define if job is "stuck" and there is no way to notify responsible people about such situation.
The identifying of that kind of situation is a manual task.
Goal:
Provide a solution to automate monitoring of job executions whose progress stopped (stuck) and notify users about this situation
Phase 1 JobExecution Monitoring
Database table structure:
The proposed solution is to create a separate table with sample name "job_monitoring" to accumulate information about jobExecutions with following structure:
job_execution_id | last_event_timestamp | notification_sent |
---|---|---|
5a289466-8ae4-446e-9718-74c21845cee3 | 2021-04-29T18:12:15.607+0000 | false |
49d97e22-61ae-452b-8497-5ff6d68ba9f4 | 2021-04-29T18:07:12.607+0000 | true |
where
job_execution_id - UUID identifier of the job
last_event_timestamp - timestamp of the last updation for jobExecution
notification_sent - boolean value that indicates whether notification was send or not
example above based on sample assumption:
current date - 2021-04-29T18:14:17.607+0000
time unit to designate stuck execution - 5 min
Database table maintenance:
"job_monitoring" table will be populated with data on following actions:
INSERT:
- on initialization of JobExecution at JobExecutionProgressDaoImpl
in initializeJobExecutionProgress
method.
UPDATE:
- on any updation of job progress at JobExecutionProgressDaoImpl
in updateByJobExecutionId
method.
Also we need to reset the state of "notification_sent" flag in the database to allow future notifications on given job. this will also allow us to send a 'recover' notification when job is back alive (if we update an entry when flag is set)
DELETE:
- once the jobExecution is finished (i.e status is COMPLETED/ERROR) the associate row from monitoring table should be deleted. In case of data-import this is equivalent when
number of total chunks for jobExecution = getCurrentlySucceeded + getCurrentlyFailed. This purpose serves updateJobExecutionIfAllRecordsProcessed
method in RecordProcessedEventHandlingServiceImpl
class.
JobExecution monitoring
The main idea of monitoring is to have some job/timer/task inside of the system to watch on specific actions. It is proposed to use Watchdog timer for this purpose.
When the system detects the job stopped, it writes this information (including job_execution_id, and maybe more parameters, based on notification template requested) into logger, using a fixed predefined pattern and log level.
For example: log level = ERROR and message is "Data Import Job with jobExecutionId = %job_execution_id% not progressing"
After this step the monitoring job updates boolean flag in the database "notification_sent" to "true"
The environment (AWS, Kibana or what ever is used to monitor the installation) is set up to track those messages in a log. If the message is detected, it's split into tokens (ex. jobExectutionId) and the alert is generated.
The monitoring system should have a recipient/recipient group to be set up to send an email on this alert.
Requirements:
- configurable start time - TBD time unit
Question | Answer |
---|---|
Time unit to monitor? | for the v.1 default time is 20 min. |
Log message | level: ERROR; message: "Data Import Job with jobExecutionId = %job_execution_id% not progressing" |
Notification channel | for v.1 it is dependent on customer external monitoring tool(AWS, Kibana, etc.) |
Task List | Jira | high level estimation(story points) |
---|---|---|
| 5 | |
| 5 | |
| 3 | |
| 8 |
For the v.1 of "Spike: Monitoring for data-import" it is decided that the monitoring(sending emails, maintaining receivers list) will be covered by external tools(AWS, Kibana, etc.) depends on the customer setup.
Phase 2 Receivers list
Requirements:
- E-mail list (configurable)
Question | Answer |
---|---|
How the receivers list will be populated?
|
Task List | Jira | high level estimation(story points) |
---|---|---|
Phase 3 Sending emails
Requirements: Ann-Marie Breauxto finalize requirements with librarians and add here, along with sample e-mail mockup
Message should include
- Message date and time stamp (localized date and time)
- Job number
- User who started the job
- File name
- Job profile
- Job start date and time
- Job stop date and time (if it stopped)
Question | Answer |
---|---|
Sample mockup for email? | |
Task List | Jira | high level estimation(story points) |
---|---|---|
Configuration:
- No UI planned at this time
- Vladimir Shalaevdocumenting configuration details for implementing in various environments:
- AWS
- Kubernetes
- On prem
- Next steps: Will review with DevOps and then link here
- Should this be directed at the hosting provider only (as first iteration)? Still TBD; will take more additional work if customized to individual tenants
Improvements:
- per Oleksii Kuzminov - different levels of WARN messages might be considered in next implementation version
- per Vladimir Shalaev - job stucked analyzer improvement - if all other job is also not updated within predefined unit of time - this might be considered as some module is about to restart.