Use your O/S to Multi-Process
Introduction
This page is not about how to program multi-threading, multi-processing, parallel processing, prevention of deadlocks, how to work with MUTEX (semaphores, etc), race-conditions, good practices to working with threads, etc. This page is not about encouraging you to evolve your applications to begin using multi-processors to your advantages (Intel wrote a great overall intro to why you should think about doing parallel processing for game developers too). You can find all these kinds of documentations, white-papers, etc all over the web (my recommendations are to go to MSDN, IBM, Intel, and even developers.net to search for what you are looking for).
If you are an advance programmer, familiar with parallel processing, distributed (and grid) computing (Microsoft calls it HPC), or multi-threadings, this page is not for you. Most likely, you’ve Google’d searching for keyword such as “CreateThread” or “CreateProcess” and stumbled upon this page searching for sample code, in which case, my recommendations are to go to MSDN and use their sample code.
What this page is about is to demonstrate how easy it is to code in taking advantages of multi-processors using Windows APIs of either CreateThread() or CreateProcess(). The sample code is not truly thread-safe and does printf() without considerations of MUTEX possibly causing deadlocks, but that’s up to you (as a programmer) to make it work.
The sample code is barely a useful application, but it was written with proof-of-concept in mind to let you run and realize for yourselves that just by creating a thread, thanks to the Operating System (in this case, Windows XP Pro, Vista, Server) mechanism, you automatically get a free parallel processing feature on a multi-processor CPU system (if the O/S recognizes you have multiple processor that is).
Sample Code
Before we go more into depth, we’ll start off with the sample code. It is unfortunate that such an intuitive and useful application as WordPress cannot display source code well, I’m using now using Highlight Source Pro plug-in but it still is difficult to read (it’s way better than “Code Snippet” at least because it does not flow over the view), thus you’ll find the link here of the actual code.
The code was written and tested on the following configurations:
- Visual Studio Express Edition 2008 Beta (VC9) – I cannot afford Visual Studio (that’s why I’ve switched to Linux for home – at work, I’m spoiled with Visual Studio) but this Express Edition is superior! Almost every common features I use at work on Visual Studio Professional (or Enterprise or Architect) is on Express Edition. So what if I cannot have C# and C++ (Managed) on the same solution, how often does that happen? I’m so surprised that Microsoft would give away such high quality IDE for free!
- Windows Vista (32 bits) – My laptop came with 32-bits Vista (although most of the time, I’m using XUbuntu). I could have fired up Windows XP Home to test it on my desktop (this one is Gentoo compiled with x86_64 AMD Opteron X2) but from what I recall, Home edition does not support multi-processor (I may be wrong).
- AMD Turion 64 X2 – It’s an HP Pavilion dv6000 series. I was kind of upset when I ordered from HP and specifically requested for 64-bit Vista just to try it out (since I knew most of the time, I’d be on Linux, I didn’t care if 64-bit version was incompatible with drivers, etc). But in any case, only reason why you’d want 64-bits Windows is to deal with memory barrier of greater than 3GBytes and I’d be insane to have a 4GByte laptop! I also could have tested it on my Opteron 64 X2 but well, I’m lazy…
You should be able to just create a Console Application in Visual Studio (should work on VC7, VC8, and VC9Beta), copy-and-paste the entire code and compile it. I don’t remember if VC7 (Visual Studio 2003) went with _tmain() or main() but if it doesn’t support UNICODE based mode, just edit the section that are UNICODE macros.
Speaking of UNICODE, all char* (string) based methods (such as printf(), etc) are using Visual Studio macros so that whether you compile with UNICODE defined or not, it should work transparently. For more infomation on this, go to Routine Mappings for more details (I think this link is for VC8).
//--------------------------------------------------------------------------------------------------------------------------------- // Author: Hideki A. Ikeda // Purpose: Demonstrate and test the usage of CreateThread() and CreateProcess() // Date Created: 06.14.2008 //--------------------------------------------------------------------------------------------------------------------------------- #include "stdafx.h" #include // for usage of toupper() #include // for std::string //--------------------------------------------------------------------------------------------------------------------------------- const unsigned int MINIMUM_PROCESSORS_COUNT = 2; // default of 2 process minimum //--------------------------------------------------------------------------------------------------------------------------------- bool DoMasterProcess(_TCHAR * szApplicationName); bool DoMasterThread(void); bool DoTask(void); DWORD WINAPI DoWorkerThread(LPVOID lpParam); //--------------------------------------------------------------------------------------------------------------------------------- int _tmain(int argc, _TCHAR* argv[]) { // argv[1]: // * P = Multi Processor mode // ** S = SubProcess created from main Process - this is because CreateProcess() runs another instance(s) of this application (as another process) // * T = Multi Threaded mode // * O = OpenMP mode (let the interface parallize the methods) if (argc > 1) { LARGE_INTEGER startQPC; LARGE_INTEGER frequency; QueryPerformanceCounter(&startQPC); QueryPerformanceFrequency(&frequency); // We onnly care about the first character of arg[1] (also force it upper case so we only do single compare per test) _TCHAR parameter1 = _totupper(argv[1][0]); if (parameter1 == _T('P')) { if (DoMasterProcess(argv[0]) == false) { return(-1); } } else if (parameter1 == _T('S')) { if (DoTask() == false) { return(-2); } } else if (parameter1 == _T('T')) { if (DoMasterThread() == false) { return(-3); } } else if (parameter1 == _T('O')) { // Cannot utilize OpenMP on Express Edition. For more details, see https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=98939&wa=wsignin1.0 } else { } LARGE_INTEGER endQPC; QueryPerformanceCounter(&endQPC); const double totalTimeInSeconds = (endQPC.QuadPart - startQPC.QuadPart) / (double) frequency.QuadPart; _tprintf(_T("\n\nTotal time was %f seconds\n\n"), totalTimeInSeconds); } return(0); }
//--------------------------------------------------------------------------------------------------------------------------------- bool DoMasterProcess(_TCHAR * szApplicationName) { bool processCompleted = false; if (szApplicationName && _tcsclen(szApplicationName)) { SYSTEM_INFO systemInfo; GetSystemInfo(&systemInfo); // query how many processor this machine has // NOTE: CreateProcess() will not spawn a process if the application name is passed incorrectly (specifically UNICODE versus ASCII). // Also, because argv[0] may contain space (because of full path name that has subfolder with spaces), it needs to // be wrapped with quotes or else it any string separated by space would be treated as argv[n] command line arguments. // Secondly, because UNICODE mode actually alters the command line arg, you cannot pass a const data or else it will // cause access violations. const unsigned int sizeofBuffer = 1024; _TCHAR appName[sizeofBuffer]; appName[0] = 0; _TCHAR commandArgs[sizeofBuffer]; commandArgs[0] = 0; _stprintf_s(appName, sizeofBuffer, _T("%s"), szApplicationName); // no need to wrap it with double-quotes? _stprintf_s(commandArgs, sizeofBuffer, _T("\"%s\" S"), szApplicationName); // make sure arg[0] part of the command line (must be wrapped with double-quotes) #if defined(_DEBUG) _tprintf(_T("Start: MultiProcess mode - Creating two processes of '%s'\n"), szApplicationName); #endif // create two sub processes unsigned int processCount = MINIMUM_PROCESSORS_COUNT; if (processCount < systemInfo.dwNumberOfProcessors) { processCount = systemInfo.dwNumberOfProcessors; } _tprintf(_T("Creating %d Sub-Processes\n"), processCount); STARTUPINFO * subProcessStartupInfo = new STARTUPINFO[processCount]; PROCESS_INFORMATION * subProcessInfo = new PROCESS_INFORMATION[processCount]; for (unsigned int currentProcessIndex = 0; currentProcessIndex < processCount; ++currentProcessIndex) { ZeroMemory(&subProcessStartupInfo[currentProcessIndex], sizeof(STARTUPINFO)); subProcessStartupInfo[currentProcessIndex].cb = sizeof(STARTUPINFO); ZeroMemory(&subProcessInfo[currentProcessIndex], sizeof(PROCESS_INFORMATION)); // For more details, see http://msdn.microsoft.com/en-us/library/ms682425(VS.85).aspx BOOL createSuccess = CreateProcess( appName, // application name - this is without the double-quotes wrapped commandArgs, // command line argument with argv[0] being the actual module (program) name. NULL, // ProcessAttributes NULL, // ThreadAttributes FALSE, // InheritHandles NORMAL_PRIORITY_CLASS, // CreationFlags NULL, // Environment NULL, // CurrentDirectory - set this to NULL to use the same directory as the calling process! &subProcessStartupInfo[currentProcessIndex], &subProcessInfo[currentProcessIndex]); if (createSuccess == FALSE) { DWORD lastError = GetLastError(); _tprintf(_T("Unable to create SubProcess #%d with error #%d (0X%08X)\n"), currentProcessIndex, lastError, lastError); return(false); } else { #if defined(_DEBUG) _tprintf(_T("\tSubProcess %d created successfully\n"), currentProcessIndex); #endif } } // wait until the two processes completes for (unsigned int currentProcessIndex = 0; currentProcessIndex < processCount; ++currentProcessIndex) { WaitForSingleObject(subProcessInfo[currentProcessIndex].hProcess, INFINITE); #if defined(_DEBUG) _tprintf(_T("\t\tSubprocess %d completed\n"), currentProcessIndex); #endif } // Close process and thread handles. for (unsigned int currentProcessIndex = 0; currentProcessIndex < processCount; ++currentProcessIndex) { CloseHandle(subProcessInfo[currentProcessIndex].hProcess); CloseHandle(subProcessInfo[currentProcessIndex].hThread); } #if defined(_DEBUG) _tprintf(_T("Done: MultiProcess mode\n")); #endif delete [] subProcessStartupInfo; delete [] subProcessInfo; processCompleted = true; } return(processCompleted); }
//--------------------------------------------------------------------------------------------------------------------------------- bool DoMasterThread(void) { bool threadCompleted = true; SYSTEM_INFO systemInfo; GetSystemInfo(&systemInfo); // query how many processor this machine has unsigned int processCount = MINIMUM_PROCESSORS_COUNT; if (processCount < systemInfo.dwNumberOfProcessors) { processCount = systemInfo.dwNumberOfProcessors; } _tprintf(_T("Creating %d Worker Threads\n"), processCount); HANDLE * hThreadArray = new HANDLE[processCount]; // handle returned by GetCurrentThread() DWORD * dwThreadIdArray = new DWORD[processCount]; // ThreadID is optional but useful for debugging (but not really used in this demo code) // Create worker threads. for (unsigned int currentThreadIndex = 0; currentThreadIndex < processCount; ++currentThreadIndex) { // Create the thread to begin execution on its own - For mroe details, see http://msdn.microsoft.com/en-us/library/ms682453(VS.85).aspx hThreadArray[currentThreadIndex] = CreateThread( NULL, // lpThreadAttributes - Use default security attributes 0, // dwStackSize - Use default stack size DoWorkerThread, // lpStartAddress - My thread function name NULL, // lpParameter - My argument to thread function 0, // dwCreationFlags - Use default creation flags &dwThreadIdArray[currentThreadIndex]); // lpThreadId - CreateThread() returns the thread ID and is useful for debugging if (hThreadArray[currentThreadIndex] == NULL) { // No need to call ExitProcess() (although you'd normally want to) because the main function will bail out upon failure DWORD lastError = GetLastError(); _tprintf(_T("Unable to create Thread #%d with error #%d (0X%08X)\n"), currentThreadIndex, lastError, lastError); return(false); } } // for() // Wait until all threads have terminated. WaitForMultipleObjects(processCount, hThreadArray, TRUE, INFINITE); // bWaitAll = TRUE // Close thread handles for (unsigned int currentThreadIndex = 0; currentThreadIndex < processCount; ++currentThreadIndex) { CloseHandle(hThreadArray[currentThreadIndex]); } // clean up delete [] dwThreadIdArray; delete [] hThreadArray; return(threadCompleted); }
//--------------------------------------------------------------------------------------------------------------------------------- // Normally, the lpParam is the data pointer to lpParameter passed via CreateThread() DWORD WINAPI DoWorkerThread(LPVOID /*lpParam*/) { if (DoTask() == false) { return(1); } return(0); }
//--------------------------------------------------------------------------------------------------------------------------------- // The task is simple. Just wait 1 second and leave. // The idea is that if the system is truly parallel, as long as we spawn equal number of processes or threads for the number of // processors on the system, it should return all at once. // For example, if you have a dual-core and you spawn 2 threads. It should execute these two threads simultaneously, thus // the time it should take to execute both threads would be 1 second (plus small overhead) because they should run in parallel. bool DoTask(void) { bool processCompleted = false; // TODO: Set affinity to assign per process so that we're forcing (but garaunteeing) that each process is dedicated to each processor // SetThreadAffinityMask(GetCurrentThread(), bitProcessorAffinityMaskFlag); Sleep(1000); return(processCompleted); }
About the code
So the demo basically has two modes. You use the command line parameter “P” to create process and “T” for threads. I was going to demonstrate OpenMP but it is not supported for Express Edition (I think on retail version, OpenMP is only supported for Enterprise and above).
In any case, the code assumes you have at least a dual processor. But for both CreateThread() and CreateProcess() mode, even on a single processor system, it will at the minimum create two. On a quad processor, it will instantiate 4 threads and/or processes.
The task job and concept is simple, just wait 1 second (1000 milliseconds) and bail out. Before the processes or threads are created, we record the starting time (QueryPerformanceCounter()) and then calculate the deltaT upon all threads (or processes) have completed.
If all the tasks are in fact ran in parallel, the total time spent on the application should be close to 1 second plus the small overhead of creating and cleaning up. If the total time of execution is way over 1 second (i.e. 2 seconds), then most likely, the threads got serialized (meaning each thread ran one after next in order of creation) rather than parallelized.
The screen-shot below are an example on my dual processor (AMD Turion 64 X2), first with parameter “P” (CreateProcess()) and followed by “T” (CreateThread()). I then experimented by increasing the MINIMUM_PROCESSORS_COUNT = 8 just to demonstrate that there are no prohibition of just because you have single processor, you are not restricted to creating multiple processes or threads. It just demonstrates that your application will scale as you get more processors on your platform.
Of course, the better scientific and academic approach would be to collect data on the single processor (with almost exact same setup) with similar frequency, collect the data on two threads, then run it in dual processor and compare (and you should see that theoretically double the performance – or half the time it takes to process – Note that I am exaggerating when I say it will double, see Parallel Computing in regards to Amdahl’s law).

In any case, the above screen-shot shows that the overhead for CreateProcess() method is way higher (0.043297 seconds = 43 milliSec) than CreateThread() (0.000405 seconds = 405 microSec). That’s a significant differences! But that doesn’t mean you should not dismiss CreateProcess() yet, nor is it the point here. The point here is that there are many ways to process tasks in parallel. Even with 43 milliseconds overhead, it was still able to process the tasks in parallel (or else it would taken 1000+ milliseconds rather than 43).
To make an assumptions that “perhaps Windows Vista is doing a great job with time-slicing and it’s not really doing multi-processing” (see time slice comment on Multitasking (Windows)) is a good point to make. According to this link, Windows (I’ve read in SysIntel’s article that Vista has a “fairer” scheduling than XP and older Windows) time-slices at 20 mSec per thread per process. That could justify that CreateProcess() method creating 2 processes, each owning 1 thread, time slicing 20 mSec each (totals to 40 mSec + 3 mSec overhead). To be honest, I cannot prove it… And what bothers me more is that when I actually create 8 processes and watch the TaskManager, I do see a flash of 8 processes created but most of them gets assigned to CPU#0 and only one commonly gets assigned to CPU#1. But then again, I’ve not done too much investigations because I’d use CreateProcess() for different purpose and I’d probably force the Thread Affinity (see Multiple Processors (Windows) for some info on setting affinities). But my guess is that because my tasks are so small, Windows was able to push all/majority of the tasks into CPU#0. I’m confident that if your tasks are more heavy-duty, Windows will probably do its fair job assigning each tasks to each CPU.
In any case, that’s the point of this page, make the O/S do the work of processing jobs in parallel, trust it, you do your part, let the smart guys at Microsoft (and if it is Linux SMP, the kernel developers) do their part about what they advertise on the box (from what I recall, only “Business”, “Ultimate” and “Enterprise” Vista supported dual processors – although I have a “Vista Home Premium” and it apparently supports dual processors since the task manager shows it and I can set affinities).
With that said, just like the SMP (Symmetric MultiProcessing), you should be able to trust Windows to distribute threads to multiple processors and should be transparent to you. You should just be able to create threads and trust that it will work. From several (work related) experiences, I’ve verified for myself that even XP Pro does a great job distributing threads to multiple processors. Again, the point here is that you should just be able to create threads, let it process it and trust that the O/S will distribute the tasks to appropriate/available processors. You can then brag that (thanks to Windows) that your application supports multi-processor and scalable to increase performance (let’s hope) as you run it on more processor system.
Finally, when or what would be the good reason to create sub-processes rather than threads? If you’ve studied the API of CreateProcess(), you’d notice immediately that the parameters represents execution of a “module” (as Microsoft calls it) based on passing data via a method of command line arguments. Folks that are familiar with the *NIX style O/S are much more comfortable with this methods because they are used to writing small applications (sometimes wrapped with GUI front-end). No offense to Windows application writers, but Windows .EXE are fat! Remember the .COM files days? .COM files had a requirement to fit into small footprint. Remember, smaller footprint means it’ll have less chances of (instruction) cache-misses. I believe that is the formula to parallel processing, to keep the jobs small and quick.
Another advantages of dealing with parallelizations as processes rather than threads is that it’s easier to implement distributed computing. Each run of the .EXE is small jobs, specific to its own purpose and nothing more. Or another method I’ve seen used is that the .EXE is a standalone tool that can be integrated from other tools to run a job. Have you ever written a simple command-line tool, and then you write another program and you have decided it would be easier to just call this tool you wrote from your C-code using C RunTime Library exec() (now called WinExec()) to run the file? Microsoft recommends that rather than using WinExec(), to use CreateProcess().
Oh, and keep in mind that nothing is preventing you from having multi-thread inside each processes. The art of programming multi-thread takes time to adjust and master (especially debugging to catch and determine the causes deadlocks), but once you get the hang of it, it’ll become easier, you just need to keep practicing.
Recent Comments