Link to home
Start Free TrialLog in
Avatar of VIVEKANANDHAN_PERIASAMY
VIVEKANANDHAN_PERIASAMY

asked on

SQL Monitoring job not allowing the job to start again

SQL_Job_Monitor JOB monitors the jobs which are running morethan 30 mins and will stop and start the job again.
Problem is ,after starting the SQL job,it again stopping it.I don't to stop it ,SQL_Job_Monitor JOB monitors job shouldn't disturb it again.

I have sql agent which will call this store proc
USE [userdatabase]
GO
/****** Object:  StoredProcedure [dbo].[test]    Script Date: 07/04/2012 16:50:44 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER procedure [dbo].[test] 
as 
begin 
waitfor delay '00:35:00'
end

Open in new window


Exec msdb.dbo.SQL_Job_Monitor 30, 0,'<ExcludeJob><JobName JOB="Export TMF PerfData"/></ExcludeJob>'
USE [msdb]
GO
/****** Object:  StoredProcedure [dbo].[SQL_Job_Monitor]    Script Date: 07/04/2012 16:46:13 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[SQL_Job_Monitor] (
          @threshold_minute int = 30,       -- Default is set to 60 minutes
          @StopJob bit = 0,                  -- Default is not to stop the job 
          @ExcludeJOb XML =NULL -- '<ExcludeJob><JobName JOB="SQL SERVER JOB"/></ExcludeJob>'
          )
 
AS
 
SET NOCOUNT ON
 
DECLARE @is_sysadmin     INT
DECLARE @job_owner       SYSNAME
DECLARE @job_id         UNIQUEIDENTIFIER
DECLARE @JobName         VARCHAR(255)
DECLARE @JobStartTime    DateTime
DECLARE @JobRunningTime bigINT
DECLARE @Msg            VARCHAR(1000)
DECLARE @ExcludeJObTable TABLE (JobName VARCHAR(1000))
DECLARE @Body varchar(max)
DECLARE	@TableHead varchar(max)
DECLARE	@Table_Content varchar(max)
DECLARE @TableTail varchar(max)
DECLARE @mail_subject VARCHAR(255)
 
/** Insert the jobs to be excluded from job monitoring  */
INSERT @ExcludeJObTable (JobName)
SELECT xmllist.rows.value('@JOB','varchar(1000)') 
from @ExcludeJOb.nodes('/ExcludeJob/JobName') xmllist(rows)

-- Step 1: Create intermediate work tables to store the running job data
CREATE TABLE #xp_results (job_id                UNIQUEIDENTIFIER NOT NULL,
                       last_run_date         INT              NOT NULL,
                       last_run_time         INT              NOT NULL,
                       next_run_date         INT              NOT NULL,
                       next_run_time         INT              NOT NULL,
                       next_run_schedule_id  INT              NOT NULL,
                       requested_to_run      INT              NOT NULL, -- BOOL
                       request_source        INT              NOT NULL,
                       request_source_id     sysname          NULL,
                       running               INT              NOT NULL, -- BOOL
                       current_step          INT              NOT NULL,
                       current_retry_attempt INT              NOT NULL,
                       job_state             INT              NOT NULL)

CREATE TABLE #xp_results_new (job_id                UNIQUEIDENTIFIER NOT NULL,
                       last_run_date         INT              NOT NULL,
                       last_run_time         INT              NOT NULL,
                       next_run_date         INT              NOT NULL,
                       next_run_time         INT              NOT NULL,
                       next_run_schedule_id  INT              NOT NULL,
                       requested_to_run      INT              NOT NULL, -- BOOL
                       request_source        INT              NOT NULL,
                       request_source_id     sysname          NULL,
                       running               INT              NOT NULL, -- BOOL
                       current_step          INT              NOT NULL,
                       current_retry_attempt INT              NOT NULL,
                       job_state             INT              NOT NULL)
 
-- Step 2: Capture job execution information (for local jobs only since that's all SQLServerAgent caches)
SELECT @is_sysadmin = ISNULL(IS_SRVROLEMEMBER(N'sysadmin'), 0)
SELECT @job_owner = SUSER_SNAME() 
IF ((@@microsoftversion / 0x01000000) >= 8) -- SQL Server 8.0 or greater
BEGIN
    INSERT INTO #xp_results
    EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner, @job_id
END
ELSE
BEGIN
     INSERT INTO #xp_results
     EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner
END
 
-- Step 3: Get the current job that exceeded the monitoring threshold
 
DECLARE joblist_cursor CURSOR FOR
SELECT j.name, a.job_id, a.start_execution_date, DATEDIFF(MINUTE,a.start_execution_date,GETDATE()) AS JobRunningTime
FROM #xp_results x
INNER JOIN msdb.dbo.sysjobactivity a
ON x.job_id = a.job_id
INNER join msdb.dbo.sysjobs j
ON a.job_id = j.job_id
     WHERE x.running = 1                                               -- The job is running
     AND j.name  not IN  (SELECT Jobname from @ExcludeJObTable)               --Excluding the job Monitoring
     AND a.session_id = (select max(session_id) from msdb.dbo.sysjobactivity where job_id = a.job_id)
     AND a.stop_execution_date is null                                 -- The job is not complete 
--     AND a.start_execution_date <= dateadd(minute, -@threshold_minute, getdate())     -- Measure the thresdhold
 
OPEN joblist_cursor
 
FETCH NEXT FROM joblist_cursor
INTO @JobName, @job_id, @JobStartTime, @JobRunningTime
 
WHILE @@FETCH_STATUS = 0
BEGIN
     -- Check again if the same job at the starting time is still active
     IF EXISTS (SELECT TOP 1 * FROM msdb.dbo.sysjobactivity WHERE job_id = @job_ID AND stop_execution_date is NULL AND  
     @JobRunningTime>@threshold_minute)
     BEGIN
		
			PRINT 'Stopping job [' + @Jobname + ']'
			EXEC msdb..sp_stop_job @job_name=@JobName
			DECLARE @i INT
			SET @i=1
			INSERT INTO #xp_results_new
			EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner, @job_id			
			WHILE EXISTS (SELECT running
			FROM #xp_results_new x
			JOIN msdb.dbo.sysjobs sj ON x.job_id=sj.job_id
			WHERE running=1 OR job_state<>4)
			BEGIN
				IF @i>99
				BEGIN
					SELECT @Msg='The SQL Job - [' + @JobName + '] on ' + @@ServerName + ' has exceeded the job running threshold of ' + CONVERT(varchar(3), @threshold_minute) + ' minutes.'+
					' Job doesn''t stop in timely fashion. Number of trials to stop it: '+CAST(@i AS CHAR(3))
					RAISERROR (@Msg, 16, 1) WITH LOG
					GOTO Failed
				END
				TRUNCATE TABLE #xp_results_new
				PRINT 'It seems that job '+@JobName+' is still running.'
				PRINT 'Attempting to stop it again ... (trial no '+CAST(@i AS CHAR(2))+')'
				EXEC msdb..sp_stop_job @job_name=@JobName
				WAITFOR DELAY '00:00:10'
				INSERT INTO #xp_results_new
				EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner, @job_id
				SET @i=@i+1
			END
			TRUNCATE TABLE #xp_results_new
			PRINT 'Starting job [' + @Jobname + ']'
			EXEC msdb..sp_start_job @job_name=@JobName
			WAITFOR DELAY '00:00:10'
			
			INSERT INTO #xp_results_new
			EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner, @job_id
			
			IF EXISTS (SELECT 1 FROM #xp_results_new)
				BEGIN
					INSERT INTO [msdb].[dbo].[jobs_stats] (job_id, job_name, job_status, current_step, current_retry_attempt, job_state, entry_date, msg)
					SELECT x.job_id,
					sj.name as job_name,
					CASE running
						WHEN 0 THEN 'Stopped'
						WHEN 1 THEN 'Running'
					END AS job_status,
					current_step ,
					current_retry_attempt ,
					CASE job_state
						WHEN 0 THEN 'Not idle or suspended'
						WHEN 1 THEN 'Executing'
						WHEN 2 THEN 'Waiting For Thread'
						WHEN 3 THEN 'Between Retries'
						WHEN 4 THEN 'Idle'
						WHEN 5 THEN 'Suspended'
						WHEN 6 THEN 'Waiting For Step To Finish'
						WHEN 7 THEN 'Performing Completion Actions'
					END AS job_state,
					GETDATE() AS entry_date,
					'The SQL Job - [' + @JobName + '] on ' + @@ServerName + ' was stopped and started successfully with '+CAST(@i AS CHAR(2))+' stop trial' AS msg
					FROM #xp_results_new x
					JOIN msdb.dbo.sysjobs sj ON x.job_id=sj.job_id
					TRUNCATE TABLE #xp_results_new;
				END      

			IF EXISTS (SELECT TOP 1 * FROM msdb.dbo.sysjobactivity WHERE job_id = @job_ID AND stop_execution_date is NULL AND  
		    DATEDIFF(MINUTE,start_execution_date,GETDATE())>@threshold_minute)
		 
			BEGIN
				SELECT @Msg = 'The SQL Job - [' + @JobName + '] on ' + @@ServerName + ' has exceeded the job running threshold of ' + CONVERT(varchar(3), @threshold_minute) + ' minutes.'
				RAISERROR (@Msg, 16, 1) WITH LOG
				
				INSERT INTO #xp_results_new
				EXECUTE master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @job_owner, @job_id
				
				Failed:
				IF EXISTS (SELECT 1 FROM #xp_results_new)
				BEGIN
					INSERT INTO [msdb].[dbo].[jobs_stats] (job_id, job_name, job_status, current_step, current_retry_attempt, job_state, entry_date, msg)
					SELECT x.job_id,
					sj.name as job_name,
					CASE running
						WHEN 0 THEN 'Stopped'
						WHEN 1 THEN 'Running'
					END AS job_status,
					current_step ,
					current_retry_attempt ,
					CASE job_state
						WHEN 0 THEN 'Not idle or suspended'
						WHEN 1 THEN 'Executing'
						WHEN 2 THEN 'Waiting For Thread'
						WHEN 3 THEN 'Between Retries'
						WHEN 4 THEN 'Idle'
						WHEN 5 THEN 'Suspended'
						WHEN 6 THEN 'Waiting For Step To Finish'
						WHEN 7 THEN 'Performing Completion Actions'
					END AS job_state,
					GETDATE() AS entry_date,
					@Msg AS msg
					FROM #xp_results_new x
					JOIN msdb.dbo.sysjobs sj ON x.job_id=sj.job_id
					TRUNCATE TABLE #xp_results_new;
				/*IF @StopJob = 1
					BEGIN
						PRINT 'Stopping job [' + @Jobname + ']'
						EXEC msdb..sp_stop_job @job_name=@JobName
					END*/
				
				END
			END
     END
	
	
	 ELSE
	
	 
     FETCH NEXT FROM joblist_cursor
     INTO @JobName, @job_id, @JobStartTime, @JobRunningTime

END
 
CLOSE joblist_cursor
DEALLOCATE joblist_cursor
DROP TABLE #xp_results
DROP TABLE #xp_results_new

Open in new window

Avatar of keyu
keyu
Flag of India image

there must be some inner exception generated due to which its getting closed or failed

either you need to print proper error message at each step so you can get actual error iin job decription or need to debug manuaally your script using query analyzer...
<<Problem is ,after starting the SQL job,it again stopping it.I don't to stop it ,SQL_Job_Monitor JOB monitors job shouldn't disturb it again.>>
I read this as:
My process works fine for stopping jobs that run to long.
I automaticly restart the job in my process and that restart must be excluded from the monitoring for long running jobs... but it doesn't.

Looks normal because the name won't be in the exclude list of the next run of your monitoring job. You must make a permanent table for the excluded jobs. Permanent excluded jobs must be in there and temporary excluded jobs must be inserted there with a datetime until they must be exluded from monitoring. Clean that record when expired.
Avatar of VIVEKANANDHAN_PERIASAMY
VIVEKANANDHAN_PERIASAMY

ASKER

<<I automaticly restart the job in my process and that restart must be excluded from the monitoring for long running jobs... but  it doesn't>>

problem here is, after restarting the job, in a same cycle it stops the started job.
again even though job is running less mentioned threshold.

when it restarts, will it take the previous job time. .Probably it won't.
Just guessing.

If you run thescript on your SQL server, you will get my issue.

It should start the job and leave the job as it is.. in the nexr cycle if the same job is running morethan mentioned threshold then i shoud stop it but in the same cycle it shouldn't stop after starting the job.
Isn't it the problem (in real life job) that the canceled job is not gone yet because it's still doing it's rollback and that that can take nearly as long as it was processing in first place.


<<It should start the job and leave the job as it is..>>
Things that can interfere
- Only a fetch next in the ELSE-block
- mixed working on both job_id and name
- cursor on tables that are in modification, you even interfere in your cursor (stopping and starting) -> make it static
- GOTO Failed and your Failed:-label is somewhere in a different BEGIN/END-block??????
And could be handy to post output from job that has canceled the restarted job... you put print-commands in it to make execution tracable.... but don't share it when you have a problem.

And repeat what I already mentioned in earlier topics. Just stopping a job on duration with a exclusion-list by name and restart it is not something I would have much confidence in it will get my server more reliable.
More confidence in a system that works for jobsnames you give explicit ( with like '%%')  or when jobs are blocked for a longer time while that is what i think you want to achieve. But test on blocking jobs then.
And better try to fix repeating blocking processes to change the processes.

So if you run this job every 15 minutes and you start a job on friday late afternoon that for one time (end quarter or year, onetime large processing ...) then it could be processing a whole night or even weekend without ending because it's restarted because it's duration is 40 minutes.
<<Isn't it the problem (in real life job) that the canceled job is not gone yet because it's still doing it's rollback and that that can take nearly as long as it was processing in first place.>>
nope, i will exclude the importants job in the exclude list.
These are jobs for deleting the older datas.It's acceptable in our enviroment.

Only a fetch next in the ELSE-block
- mixed working on both job_id and name
- cursor on tables that are in modification, you even interfere in your cursor (stopping and starting) ->
how do it correct it? Your help is appreciated.

Output of the jobs:
entry_id      job_id      job_name      job_status      current_step      current_retry_attempt      job_state      entry_date      msg
53      2F9EBE7F-E668-46D8-ACAB-730CC57ED3CE      TEST      Running      1      0      Executing      2012-07-04 21:45:20.360      The SQL Job - [TEST] on servername was stopped and started successfully with 2  stop trial
<<nope, i will exclude the importants job in the exclude list.
...
These are jobs for deleting the older datas.It's acceptable in our enviroment.
>>
You mean you will exclude the jobs that exists now and you now kow that are important.
For all future jobs that will hit the limit .... you never are sure it are only delete jobs and that realy.
And if someone has a startcommand with an old excludelist ...

<<- cursor on tables that are in modification, you even interfere in your cursor (stopping and starting) ->
how do it correct it? Your help is appreciated.>>
Sorry, only thing you didn't copy was the solution 'make it a static cursor'
How to
http://msdn.microsoft.com/en-us/library/ms180169.aspx

But you also have a WAITFOR DELAY '00:00:10' in your code in a 99-loop. If you start in that loop.... how many jobs from your cursor will already be gone when you realy start to treat it?

<<Only a fetch next in the ELSE-block>>
In your if-block you don't have a fecht so you start loop all over with values from previous iteration
=> simple see your if block also ends whith the same fetch. Mostly you write that fetch after your if/else so you only have to write it once.  Here because nothing is coded in the else-block you just can delete the word ELSE.

Performance
- always creating #table and filling it with same values of exclude-list ... while it won't change every week I think => unneeded i/o
- cursor you seem to select all jobs, not only with >30 minute why doing this in the cursor
- many usage of #tables
   -> maybe good to look on how 'heavy' this monitoring job is on IO

PS: was this the job you hired someone for to create it?
Still want to learn from you why stopping a long running job and restarting it directly can be benificial for your system.

And what do you mean by stopping a job and restarting it? Have your jobs multiple steps or not ...

Maybe good to read the remarks-topic from the sp_stop_job http://msdn.microsoft.com/en-us/library/ms182793.aspx
>>Still want to learn from you why stopping a long running job and restarting it directly can be benificial for your system. <<

ideally these jobs should complete in 20 mins,so if running for a long time, we normally stop and start it again. Which will then work absolutely fine.So we are trying to automate the process.

Since I haven't use much of cursor programs,I need help in modification script.
SOLUTION
Avatar of jogos
jogos
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I tried investigating and found, the jobs are getting hung at time's and stated as running.So when we restart the job it works absolutely fine. And it take only 1 mins to completed.
Avatar of nishant joshi
can you able tell us what the exact messages are display in your jobs history.?
so that we can find exact error and solved correctly..

Regards,
Nishant
<<I tried investigating and found, the jobs are getting hung at time's and stated as running.So when we restart the job it works absolutely fine. And it take only 1 mins to completed.>>
What's new in this message?  

Many suggestions, links to tutorial ... you don't give impression you actualy take them for granted. You keep repeating the mantra 'we want to stop and restart and that's ok' and  'can you do this for me'.
i tried but unable to find the way.It could be nice , if you help me out in sorting out the code for time being.later i will plan for permanent fix. It's needs to done on - priority
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
<<i tried but unable to find the way.It could be nice , if you help me out in sorting out the code for time being.later i will plan for permanent fix. It's needs to done on - priority>>
Hire some expert for a day or two to get this job done .. but mainly to see why the jobs are hanging and if that can be fixed.  And see that you have thought about the questions I suggested higher (#a38174331) so he won't be loosing time by asking the questions a professional should ask before starting.

Again we can suggest improvements on your code but you cannot expect us to rewrite your procedure for free.

If you want to do it yourself with our suggestions maybe good not to continue on this to poluted version a38165087but start more from a basic version and build your requirements (ex exclude list) again in it.  
Take an older version or a basic version
http://www.sqlservercentral.com/Forums/Topic1167821-1550-1.aspx#bm1169007