Getting the number of CPU cycles (commands per second) a process is using

Hello experts,

Does anyone know how to get the number of CPU cycles a certain process is using in Windows XP SP2? I am looking for a result in commands per second. I don't mind what programming language or application is used (C,C++,C#, etc.).

I have tried looking through MSDN, the Alt+Ctrl+Delete menu, and an application entitled "Process Explorer", but I couldn't find what I was looking for.

I'm not sure if this kind of question is too low level for Windows. I would appreciate it if someone let me know if getting the number of commands per second of an application was possible in Windows.

I'm not looking for an answer in terms of the percentage of the CPU used up unless there's no other answer or more accurate answer. Notwithstanding, would it be possible to approximate the number of cycles per second by using the percentage of the CPU of a process? That is, am I making any invalid assumptions with this math:

commands per second = CPU Percentage * (# of commands per second, max)

Processors are often advertised with a frequency such as 2.8 Ghz. I'm not sure what this exactly means, but would it mean commands per second? That is, could I simply do this, for example:

commands per second = 30% * (2,800,000,000 commands / second)

Who is Participating?

Improve company productivity with a Business Account.Sign Up

adg080898Connect With a Mentor Commented:
It is easy to get the exact number of cycles taken. X86 processors have an instruction called RDTSC (read timestamp counter). It is a 64-bit register which is incremented (increased by one) every clock tick (meaning, on a 2GHz processor, it will increase by 2,000,000,000 per second.)

Here is a function to read the timestamp counter:

__int64 RDTSC()
  __asm rdtsc

(You will get a warning like "must return a value". Ignore it, the compiler can't understand the assembly language instruction)


If you don't need extreme precision, you can use QueryPerformanceCounter and QueryPerformanceFrequency. These functions usually have a precision of about 2 microseconds. (The actual frequency depends on the type of computer you have. Old systems are precise to about 1.19 microseconds).

It is fairly simple, you call QueryPerformanceFrequency to get the frequency. Then, you call QueryPerformanceCounter before and after the operation to be timed.

int TimeMe()
  // work to be timed....

int TimeIt()
  LARGE_INTEGER liFreq, liStart, liEnd, liElap;



  liElap.QuadPart = liEnd.QuadPart - liStart.QuadPart;
  // Get nano (billionth) of a second accuracy
  liElap.QuadPart *= 1000000000;
  liElap.QuadPart /= liFreq.QuadPart;

  printf("Time was %I64d nanoseconds\n", liElap.QuadPart);

I'm away from my development machine so I can't compile and test the code above, but I have done this hundreds of times, it should be right. :)

>> Processors are often advertised with a frequency such as 2.8 Ghz. I'm not sure what this exactly means, but would it mean commands per second? That is, could I simply do this, for example:

this is the number of cycles/operations a CPU does a second.
Marty543Author Commented:
Thanks. And I presume it may take many cycles, maybe 3-10, to actually perform one command.
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

mxjijoConnect With a Mentor Commented:

>> to actually perform one command
May be you need to explain what you mean by "command".

As you might alreay know CPU works with a a set of instructions called "instruction set".
Based on the architecture of the chip instructions can be as simple as move (mov) or complex mmx instructions.

The speed, often advertised with a cpu denotes its internal control clock speed.
Every instruction takes one or more of these clock ticks to get executed.
They also advertise a IPS (instructons per second) or MIPS (millions of instructons per second).

Back to your question,
        On complex OS's like windows/unix. There will be several programs running at the same time (time-shared).
So it will be virtually impossible to trace down cpu cycle usage  per process - because CPU keep switching between processes.
However, you may take a look at to get an idea of other possibilities
You may want to read MS docs for the functions used in the answer above.

Note that my answer above is a way of timing the amount of real time (aka "wallclock" time) that something takes. If you want to know what percentage of CPU time your program takes, you can use GetThreadTimes.

This function returns, as FILETIMEs, the amount of CPU time used in user mode and kernel mode. (Kernel time is time spent deep inside system calls).

A FILETIME is simply a 64-bit number that is in increments of 100 nanoseconds (100 billionths of a second, or 0.1 microsecond).

Please let me know if this is really what you wanted, and I'll go into some detail...
Marty543Author Commented:
Thanks for your comments and the code.

By command, I meant instruction.

Now I know that it is too low level and impractical to get the number of instructions that are executing in a certain time frame, but getting the number of cycles and an accurate time difference between code statements is easy.
There is a way to actually record the instruction count, but it is extremely low level. Using performance monitoring counters, you can track the number of instructions executed (as well as a ton of other processor internals). I say it is "extremely" low level because the RDMSR and WRMSR instructions must be executed in kernel mode, so they require a driver to execute them. They are also *very* non-portable, every cpu model (even those from the same manufacturer!) has their own list of MSR register meanings.
Marty543Author Commented:
How would that be done?

     It looks like you ARE working in low level. I am not a CPU expert, but my understanding is: a given CPU always uses same number of clock ticks for a given instaruction. For example: say mov command require 10 clock ticks on pentium 4, all mov commands will be using 10 ticks each, no matter what the arguments are. So if there are 10 mov instructions in your program, you can just add them up (10x10 = 100 ticks).  This would give you the exact number of ticks that might need to execute "your" code alone.

     What you would need is the whitepaper from the CPU manufacturer which would give you the clock ticks requires for each instructions. I don't know whether/where this information is available. But may be worth considering.

hope that helps

okay.. I just found this link. It gives you the number of clocks required for every instruction
>> "a given CPU always uses same number of clock ticks for a given instaruction"

That is not entirely correct. Older processors were quite predictable because they always executed instructions the same way. Newer processors use "out of order execution". This means that several internal processor resources are shared among execution units, and there are multiple execution units. Instructions are issued based on the availability of required instruction units. Also, instructions may be issued "out of order" based on the availability of operands. For example, assume the processor is executing instructions for the following code sequence:

mov eax,[_some_memory_operand]
mov ebx,1234
add ebx,[_some_other_memory_operand]
mov [_some_answer_variable],ebx
mov [_some_other_answer_variable],eax

This code:
- reads "some_memory_operand" into the eax register,
- loads 1234 into the ebx register,
- adds some_other_memory_operand to ebx,
- stores the sum in some_answer_variable,
- and stores eax in some_other_answer_variable.

Now let's assume that the memory for some_memory_operand is not in the cache (on the cpu core) and the processor must go all the way to the motherboard to read it. Let's also assume that the memory for some_other_memory_operand (the second instruction) IS in the cache. In this case, an older processor would *wait* until the first instruction pulled the data into the cache before continuing execution, even though it can immediately execute the second instruction. Newer processors use "out of order" execution, so it can actually get ahead and process the instructions after the first one even though it cannot complete the first one yet.

This causes many instructions to have widely varying timings even though they seem to be "simple" instructions. It all depends on the current execution context at the moment that an instruction is issued.

All processors since the pentium pro use out of order execution extensively. This makes them much faster and far less sensitive to the order of the program instructions. Because x86 processors have very few registers, this drastically improves performance because there are barely enough registers to properly "schedule" (put in the best order) the instructions.

Once all the data are in the cpu cache, instruction execution is a lot more predictable, but still varies in the amount of time taken to execute them, expecially in complex loops where incorrect branch prediction can cause pipeline flushes.

thank you for that posting adg, that was quite lot of information.
>> 10 mov instructions in your program, you can just add them up (10x10 = 100 ticks).

Again, not true anymore. Modern processors have two concepts which must be considered together when analyzing the instruction timing: latency and throughput.

Latency means "how many clock ticks until I get the answer"

Throughput means "how many clock ticks until I can issue another instruction.

For example, say a "mov reg,mem" (read memory variable into processor register has a throughput of 1 and a latency of 8. Because the latency is 1, you can issue one of those instructions on EVERY clock tick, however, the answer will not be available until 8 clock ticks later.

The reason for this is the "pipelined" nature of processor internals. Think of it like a car assembly line. If you stand at the end of the assembly line, you will see a complete car come out, say, every 30 seconds. This means that the throughput of a car construction is 30 seconds (a car is completed every 30 seconds). However, if you followed a car through the assembly line, you would say that it takes an hour to complete a car - (latency is one hour). The inside of a processor is just like an assembly line, a new instruction enters the pipeline very frequently, but the instruction takes several steps to complete. You can put a new instruction in the pipeline very frequently, but the answer does not come out the other end of the pipeline until several cycles later.

Marty543Author Commented:

Excellent example. Thanks for all of the information. mxjijo, thanks for the links.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.