bergman on 8 Dec 2010 08:31:29 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] recompiling a kernel for performance

The pithy ruminations from Mag Gam <> on "Re: [PLUG] recompiling a kernel for performance" were:

=> Thanks everyone for your responses.
=> Yes, I did look for a turbo button and apparently one does exist!  HP
=> has a secret BIOS which lets you disable powersaving modes and
=> apparently it makes things faster. :-)  

That's a good button to press. :)

I don't recall the hardware specifics, but I've found for scientific computing that hyperthreading features often result in lower performance. For our apps, on last-generation Nehalem architecture, the difference was about an 8% gain for disabling that feature.

=> Yes, I meant 2.6.36
=> The programs are mostly CPU intensive. However , I/O is always a
=> factor so I am planning to implement a distributed files system  such
=> as Hadoop.  Regarding jumboframes, do I have to make the switch on the  

Hmm.... I don't know of HPC sites using hadoop...not that it isn't possible...just that other solutions (gpfs, isilon, [g]lustre) seem to be much more widely used.

=> client side or switch side ?  

Both. Be careful...there may be problems if all devices on a switched (not routed) network don't use the same frame size.

=> Austin,
=> READ-COMPUTE-WRITE is absolute correct for my case however the CPU is  

Um, isn't that the generic description of any program that loads from persistent storage and then writes any results? :)

For reference, our lab has a mix of jobs--typical runtimes range from about 15 minutes to 8~10 days. Typically, they will read in a bunch of data (10s to 100s of MB), then compute, write some intermediate results, compute on the intermediate results, and produce a final results. The IO time is so many orders of magnitude smaller than the compute time that I haven't put much effort into optimizing IO...the return is too small. In some cases, it has been valuable to consult with the developer about the IO patterns (for example, one program had an algorithm that basically said "read a file line-by-line to the end, append a line, close the file, compute more results, repeat"...that program was observed to get slower & slower with each iteration, in part due to the IO).

=> always at 100%. I am assuming I am still CPU bound because, I read
=> about 90% of data and compute and then generate a result.  

I'm not clear on what you mean by reading 90% of the data...

=> Eric,
=> Regarding, 'Do your cluster tools allow you to measure the performance
=> of specific nodes? ', this is a very good question. Yes, I am using
=> lapack provided by Intel but the problem is I am not sure what to look  

Um, lapack is a set of computational libraries, not performance monitoring tools.

I guessing that Eric was thinking of tools like top, iostat, ganglia, vmstat, sar, mpstat.

=> for. Ideally, I want something like this... if I have a  processor, P
=> which can do X floating point operations in T,time...I run the program
=> and it should be very close to the manufacture specs. Then I know the  

Ha! What you're saying is that you want your code to perform as well as the benchmarks.... Before looking at recompiling the kernel, I'd strongly suggest running benchmarks (ideally the same ones used by the manufacturer to produce their FLOPS specs). You may find that the machine is very close to it's specs, and that the observed difference is within the application.

In my experience, the greatest point of inefficiency (and place for improvement) is in the application code, not in kernel tuning. I've found a few notable exceptions to this (network tuning, sometimes IO cache and scheduler algorithm tuning, NFS), but rarely for CPU-bound apps.

I don't recall the details of your environment -- however, if you've got multiple apps per-node, a mix of application types, interactive sessions on the compute nodes, etc. or anything that results in a variety of simultaneous workloads, then the performance for each app will rarely meet the theoretical processor limit.

You mentioned octave, R, and Matlab. One thing to be aware of is that Matlab (depending on the version, whether you're running an MCC app or .m files, etc.) doesn't play nicely with others on a multiprocessor machine....Matlab (up to R2009B, AFAIK) cannot be restricted to a limited number of CPUs greater than one...either it's single threaded or it will use all processors. When people submit jobs to the scheduler (SGE) without indicating that the jobs are multi-threaded...when the Matlab section starts executing, and there are N-1 jobs on an N CPU-core machine, and each job tries to use N gets very slow.

=> processor is at optimal state. I am not sure if anything like that
=> exists....please guide me if one does.
=> p.s 50 node cluster is our test cluster. We have a much larger one for
=> production :-)  

What are you using to manage the clusters (ROCKS, Platform HPC, etc)? Look for monitoring tools (ganglia, ntop, etc.) that are bundled with the management suite.

Mark 'planning a 500-core cluster & remembering when that was large' Bergman

=> On Mon, Dec 6, 2010 at 11:22 AM, Austin Murphy <> wrote:  
=> > Hi Mag,
=> >
=> > On Mon, Dec 6, 2010 at 8:26 AM, Mag Gam <> wrote:
=> > ...  
=> >> simulation may consist of a Octave, Python, R, and MATLAB process
=> >> which reads data and generates data. Each process can take 60 mins to
=> >> 70 hours. I am sure there are other tuning we can do such as -- tune
=> >> I/O subsystem, tune network, etc...  
=> > ...  
=> >> Assuming all the 'low bearing fruit' have been picked would
=> >> recompiling with the latest 2.3.36 kernel help in computing speed?  
=> > ...  
=> >> Also, are there any settings in the kernel I can set to enhance
=> >> performance -- According to redhat you should stick with their build  
=> >
=> > I don't think you are going to find a hidden "turbo button" in the
=> > kernel tunable options.
=> >
=> > For the most part, the kernel is already configured for maximum speed
=> > across a wide range of possible workloads without unreasonable
=> > side-effects. Â The tunable options give you a chance to make some
=> > workloads faster at the expense of making other workloads slower.
=> >
=> > If you have a 50 node environment, I'd guess that the biggest gains
=> > will be seen in improving the performance of your shared storage.
=> > Ethernet jumbo frames or TCP offload might help if you have the
=> > hardware support. ÂMounting with "noatime" can cut down on a lot of
=> > unnecessary writes.
=> >
=> > You might also want to oversubscribe your CPUs. ÂFor example, if your
=> > processes go like this: READ--COMPUTE--WRITE, there is probably a lot
=> > of free CPU time available while reading and writing to run more
=> > threads or jobs. ÂAn 8 core server with sufficient RAM might be able
=> > to run 12 or 16 jobs in about the same amount of time as 8 jobs.
=> >
=> > Austin
=> > ___________________________________________________________________________
=> > Philadelphia Linux Users Group     --    Â
=> > Announcements -
=> > General Discussion Â-- Â
=> >  
=> ___________________________________________________________________________
=> Philadelphia Linux Users Group         --
=> Announcements -
=> General Discussion  --

Mark Bergman    Biker, Rock Climber, Unix mechanic, IATSE #1 Stagehand
'94 Yamaha GTS1000A^2

I want a newsgroup with a infinite S/N ratio! Now taking CFV on:
15+ So Far--Want to join? Check out: 
Philadelphia Linux Users Group         --
Announcements -
General Discussion  --