Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3

Please create an account to participate in the Slashdot moderation system

Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3 (hothardware.com) 167

Posted by timothy on Sunday November 15, 2015 @12:15PM from the upping-the-count dept.

MojoKid writes: Intel has made a habit of launching enthusiast versions of previous generation processors after it releases a new architecture. As was the case with Intel's Haswell architecture, high-end Broadwell-E variants are expected and it looks like Intel is readying a doozy. Recently revealed details show four new processors under the new HEDT (High-End Desktop) banner for Broadwell, which is one more SKU than Haswell-E brought to the table. The most intriguing of the new chips is the Core i7-6950X, a monster 10-core CPU with Hyper Threading support. That gives the Core i7-6950X 20 threads to play with, along with a whopping 25MB of L3 cache. The caveat is the CPU's clockspeed — it will run at just 3.0GHz (base), so for applications that aren't properly tuned to take full advantage of large core counts and threads, it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo).

This discussion has been archived. No new comments can be posted.

Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3

Load All Comments

Search 167 Comments Log In/Create an Account

Comments Filter:

Software needs to catch up (Score:2)

by iamacat ( 583406 ) writes:

Mainstream programming languages are still sequential by default and the likes of OpenCL are too hard to learn for simple tasks. UI code is still single threaded in most systems, and that drags most computation into that thread as well through programmer laziness. It's time for languages which are parallel by default and where ability to parallelize a loop is verifiable at compile time. Yes I know FORTRAN is much closer to that then C/Java, but that's due to being primitive to a degree that will not fly in
- Re: (Score:2)
  
  by buchner.johannes ( 1139593 ) writes:
  
  Julia and Rust have some intriguing parallelisation mechanisms.
  What I would like to write is code that has a dependency graph (or the compiler figures out the dependencies and parallelises by itself).
  In the meantime I simple write code for 1 processor, and run that on many data sets in parallel (using make or doit [pydoit.org]).
  - Re: (Score:2)
    
    by TeknoHog ( 164938 ) writes:
    
    Julia and Rust have some intriguing parallelisation mechanisms. [...] I simple write code for 1 processor, and run that on many data sets in parallel
    I haven't found Julia's parallelism very efficient, but maybe it's just my lack of coding skills. Then again my work [algoristo.com] is rather parallel by nature (independent pixels), so I simply run several processes using shell scripts.
    For example, it would be nice if something like map() were always parallelized as it kind of assumes independent data points, but there are still other considerations like memory management. Julia's pmap() seems to have too much overhead to be of any help, especially when the separate p
- Re: (Score:2)
  
  by viperidaenz ( 2515578 ) writes:
  
  Java is pretty easy to write multi-threaded code.
  - Re: (Score:2)
    
    by iamacat ( 583406 ) writes:
    
    Multithreading requires every instance of concurrent execution to be micromanaged by the programmer, leading to a lot of code which is not parallelized in practice. Potential concurrency should be the default case for, say, all for loops and serialization an explicit paradigm that a programmer is aware of. Coupled with strong compile time checking that can detect safe and unsafe code.
    - Re: (Score:2)
      
      by viperidaenz ( 2515578 ) writes:
      
      Here's an example from Oracle:
      double average = roster .parallelStream() .filter(p -> p.getGender() == Person.Sex.MALE) .mapToInt(Person::getAge) .average() .getAsDouble();
      Lamba's and extensions to the Collections framework have made parallel loops simple.
      You can't have the compiler parallelize loops if the methods you call can be overridden. Forcing every method called to be final, just so you can optimize some loop is a little daft. It's also pointless (read: slower) parallelizing a loop that only runs
      - Re: (Score:2)
        
        by iamacat ( 583406 ) writes:
        
        Nothing has to be made final, subclasses just need to obey the contract declared by superclass. This can be accomplished, in the worst case, by making everything synchronized.
        Make default for loop potentially parallel and have compiler complain if it can not prove that by either code inspection or, as a last resort, explicit annotation on the loop or methods that it calls. Then introduce an sfor keyword for when you really have to make things sequential.
        
        Re: (Score:2)
        
        by Megol ( 3135005 ) writes:
        
        IMHO there should be 3 basic types of loops: FOR, PAR, SEQ. FOR loops can be parallelized if possible by the compiler, PAR loops have parallel semantics and SEQ loops have sequential semantics.
        Then most loops can use the FOR variant but when advantageous the programmer can use PAR or SEQ depending on the situation and/or to help the compiler and improve readability.
        
        Re: (Score:2)
        
        by viperidaenz ( 2515578 ) writes:
        
        Slapping synchronized on a method doesn't solve all your multi-threaded problems.
        The simplest example is probably a Hashtable, all methods are synchronized.
        If you want to replace a value in the map, calling get, then set is effectively a read-modify-write, that operation needs to be protected by enclosing it all in a synchronized block.
        You'll also risk dead locks.
        If Object A's synchronized method calls Object B's synchronized method and vise versa, having two threads calling both those methods risks a dead
- Re: (Score:2)
  
  by Dutch Gun ( 899105 ) writes:
  
  It's not just the programming languages. Most *tasks* of any complexity tend to be highly sequential in nature. There are some rare exceptions, but the notion that a language can just automatically parallelize loops and get some massive speedup is not very feasible, I hate to say. It tends to work best in highly contrived or specialized situations. You have to be running some serious computation in a LOT of loops for that to pay off in any way, or the overhead is simply a non-starter. Moreover, those c
  - Re: (Score:3)
    
    by iamacat ( 583406 ) writes:
    
    Really? Every browser window or tab should hang if Javascript in one of them is slow? Loading and decoding an image for one of the icons on screen should prevent the UI from processing touch events? Many of those problems have been solved by important applications ad-hoc, but sane behaviour by default would be great.
    - Re: (Score:2)
      
      by Dutch Gun ( 899105 ) writes:
      
      A web browser is somewhat of a unique case, as each tab is more or less equivalent to a separate application - or at least, it should be. Chrome certainly proved it can effectively be done that way, I think. I agree - a single page should not be able to slow down the entire browser - that's terrible design.
      When I was talking about UI, I was talking more like .NET's WPF or Qt, perhaps, neither of which are thread-safe because of performance concerns. It's critical for the programmer to do the work of hand
- Re: (Score:2)
  
  by angel'o'sphere ( 80593 ) writes:
  
  Basically every Java program I have ever seen is multithreaded. (that only excludes hello world programs etc.)
  No idea why you explicitely mentioned Java and C# ... the first one is a particular bad example.
  - Re: (Score:2)
    
    by Bengie ( 1121981 ) writes:
    
    My first non-homework program ever was a threaded in C# 2.0 and was for my first job. Threading is pretty easy. With .Net 4 and even more so with 4.5, they took a lot of the boring parts and let me simply connect my parts together is pre-made legos.
    
    I can understand how threading can be hard for systems that are very latency sensitive like 3D games, but anything that just needs scaling and throughput, threading is brain dead easy.
Wrong specs on Skylake (Score:4, Informative)

by SeeManRun ( 1040704 ) writes: on Sunday November 15, 2015 @12:49PM (#50934981)

I have been seeing this a lot lately for some reason. The i7 6700K runs at 4.0 ghz base clock and turbo's up to 4.2. So it will be quite a performance beatdown by Skylake if clockspeed instead of threads is important.

Share
twitter facebook
- - Re: (Score:2)
    
    by Nemyst ( 1383049 ) writes:
    
    Yeah, which is pointless because comparing clock speeds between different manufacturers has never meant anything. Your A10 still tanks against most Intel processors in any but the most parallel of use cases. Oh, and you do need a separate video card if you actually care about having a GPU. APUs are still laughably underpowered.
    - Re: (Score:2)
      
      by Blaskowicz ( 634489 ) writes:
      
      Underpowered, but for most people that would be the most powerful GPU they ever had on a PC.
      You need more for games but we're long past diminishing returns and with passing times you get to use uglier versions of Windows, also games are all tied to "app stores" like Steam and others so you can't sell or lend them and the store spies on you. Boring.
      I can see myself upgrading to the next generation of APUs (better CPU, ddr4 support, supported by the new free driver + non-free blob architecture for linux) and
  - Re: (Score:2)
    
    by alvinrod ( 889928 ) writes:
    
    The current AMD architecture has crummy IPC (instructions per clock) compared to Intel so you'd likely have to get it running close to 5.5 GHz for it to have competitive performance and the FP units are shared between each module's cores, which means for non-integer workloads you really have 6 cores.
    
    However, for what AMD is selling it for, it's a better deal than the comparatively priced Intel chip, especially if you're doing anything that can make use of all of those cores.
    - - Re: (Score:2)
        
        by FlyHelicopters ( 1540845 ) writes:
        
        5.5? My 7850k replaced an Intel P4 @3.1. At 5Ghz I would need liquid nitrogen cooling. Not worth it imo. I'll take a hign end xeon over any skylake or other I7 flagship any day all day. Thanks for pushing my envelope and opening my eyes.
        You missed the point...
        A 2.4GHz Core2Duo would crush that P4 outright, even in single thread work.
        A P4 955 EE running at 3.64GHz (dual core) is outright crushed by a 3.3Ghz Core2Duo, sometimes by a factor of 4.
        Clock speed is just one measure of performance. A Skylake 6700K is, even at stock speeds, generally faster than anything AMD makes, even overclocked.
400 Thread Count (Score:1)

by Anonymous Coward writes:

C'mon Intel, everyone knows that 400 thread count is the minimum needed for a good night's sleep.
- Re: (Score:1)
  
  by U2xhc2hkb3QgU3Vja3M ( 4212163 ) writes:
  
  Nope, 400 still isn't enough. A lot of us are still waiting for a CPU with over 9000 threads.
- Re: (Score:2)
  
  by KGIII ( 973947 ) writes:
  
  Pfft... 1500 count Egyptian, or I'm going home!
  (No, not really. I don't actually know what the thread count is. One, I'm in a hotel STILL and, two, I don't actually buy my own bedding at home.)
AMD's response? (Score:4, Interesting)

by Khyber ( 864651 ) writes: <techkitsune@gmail.com> on Sunday November 15, 2015 @01:08PM (#50935043) Homepage Journal

Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.
For now, they've got the consoles holding them afloat. And while I am an AMD fan, I see they are rapidly losing out on the desktop space when it comes to performance (despite both companies having rather meager performance gains for the past several years.)
They'd better figure out what the fuck they're doing, and come up with some competing responses, quickly. Hell, I've got ideas for them, all involving that HBM tech.
1. Use a modified version of that HBM tech to stack their CPU cores and load it up with tons of cache memory (for their non-APU line.) And don't forget to drop a process node, for fuck's sake.
2. Use modified HBM tech to create stacked CPU/GPU/RAM/CACHE on the same die (for their APU line.)
3. Use modified HBM to create stacked single-die CrossFire GPUs that don't consume gobs of power (GPU line.)
4. Use modified HBM tech to create a true monolithic SOC package that integrates EVERYTHING, thus eliminating the need for motherboards - at that point and time, it just becomes a breakout board with a socket. They could probably do away with the interposer as well if They were clever enough in the design.

Share
twitter facebook
- Re:AMD's response? (Score:5, Informative)
  
  by Nemyst ( 1383049 ) writes: on Sunday November 15, 2015 @01:37PM (#50935173) Homepage
  
  HBM only works for stacking memory (hence why it's called High Bandwidth Memory). You can't stack CPU cores because they output waaaaaay too much heat. You can dissipate heat from memory passively, so stacking them and slapping an active cooler can work. Good luck stacking CPU cores in the same way.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Khyber ( 864651 ) writes:
    
    "You can't stack CPU cores because they output waaaaaay too much heat."
    Microfluidic cooling to an IHS. Easy-peasy.
    - Re:AMD's response? (Score:4, Insightful)
      
      by gman003 ( 1693318 ) writes: on Sunday November 15, 2015 @02:21PM (#50935355)
      
      Being able to describe something in five words does not make it easy.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by Khyber ( 864651 ) writes:
        
        It's that easy as we're doing it to make large linear high-power LEDs run very cool.
        Protip for AMD: Start with a thicker wafer, do the underside with your required cooling channels, do the topside with your typical litho. You CAN wick heat from the underside.
- Re: (Score:2)
  
  by Kjella ( 173770 ) writes:
  
  Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.
  Fighting the battles they can win, or are at least less likely to lose. This is a halo product of a server line of chips and getting Opterons back in the data center takes more time for validation and convincing conservative enterprises than AMD has. Zen will launch to compete with Intel's mainstream dual/quad-core chips, even if it pulls off a miracle I'm guessing it'd take at least a year or two until AMD is back to a full top-to-bottom stack.
- Re:AMD's response? (Score:5, Informative)
  
  by gman003 ( 1693318 ) writes: on Sunday November 15, 2015 @02:06PM (#50935295)
  
  AMD has been developing a new microarchitecture, Zen, which will replace the horribly-designed Bulldozer. It's rumored to be made on a 14nm node, and they re-hired the guy who designed the K10 architecture (aka the last good CPUs AMD made), so I expect it to be reasonably competitive with Intel. I really hope it is, at least.
  Your terminology is completely out of whack ("stacked single-die CrossFire GPU" is a phrase with more contradictions than whitespace characters), but I'll analyze what you were trying to say instead of what you actually said:
  #1: Current chip-stacking tech doesn't allow for all that much bandwidth between chips, especially when going above two layers. CPU cores need a pretty hefty amount of bandwidth to their cache, so that's already problematic. Stacking dies also limits thermal performance - if you stack two dies, you have 2x the heat in 1x the heat-conducting surface area. For low-power stuff, that's fine, but CPU cores get pretty hot. Many high-performance dies are already performance-constrained by how much heat they can conduct to their cooler.
  #2. This is a good idea. Or rather, the good idea is "APU on an interposer using HBM for main memory". You'd need bigger CPU caches - HBM is ridiculously high-latency even by VRAM standards, it will really hurt CPU performance otherwise. And it will limit upgradability - no way to just pop another DIMM of DDR3 in there. But the GPU gains should be worth it.
  #3. Again, thermals will absolutely prevent you from stacking GPU dies. HBM and stacking doesn't do ANYTHING for the power efficiency of the chips you're stacking, so that's two 100W+ dies on top of each other. Not gonna happen. You could stack them side-by-side on an interposer, but at that point why not just fabricate them as one die?
  #4. The cost of an interposer is significantly greater than that of a printed circuit board, and a lot of stuff won't benefit from the greater bandwidth to the CPU - stuff like a USB controller or audio chipset. Stacking the dies is also more expensive than just using a PCB - it's done in phones where space is REALLY constrained, but even the smallest desktops aren't that tight for space yet. So all that's left is putting everything onto one die - which runs into yield problems, because with bigger individual dies, a single defect will wipe out a lot more silicon. AMD actually *is* already doing this with their lowest-end laptop/desktop parts - look at Socket AM1, there's not much on the motherboard besides external connectors and power-delivery circuits. But they're also pretty low-end in performance.
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by Khyber ( 864651 ) writes:
    
    We have microfluidics for stacking dies and removing heat. We do it on p-n junctions on some of the latest LEDs (which are fucking MASSIVE at nearly 7mm x 7mm on just the die alone, not including any mount, circuitry, etc.) to keep them very cool.
    I don't speak of ideas unless I already know we've got the technology to handle it.
    - Re: (Score:2)
      
      by gman003 ( 1693318 ) writes:
      
      49mm^2 is "massive"? A high-end processor is 500-600mm^2. And even if microfluidics works to remove heat (how do you have a layer with both enough fluid channels to cool, and enough TSVs for communication?), that will increase your cost substantially. I would expect $1K+ for a quad-core CPU under this kind of design.
- Re: (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  Low end desktops and high end cluster computing are keeping AMD going. It's not as if you can put four of these 3GHz 20 thread beasts on a board, and the Xeons that can do that are both slower and cost a fortune compared with four-way AMD CPUs.
  Not very long ago I got a quote for an 80 thread Intel machine (4 Xeons) which turned out to be slightly less than ten times the price for a 64 core AMD machine with the same clock speed, memory capacity, disks etc. The Xeon machine would perform better than a singl
Into the Wayback Machine Sherman! (Score:2)

by Chas ( 5144 ) writes:

Imagine a Beowulf Cluster of these!
- Re: (Score:2)
  
  by BarbaraHudson ( 3785311 ) writes:
  
  Imagine a Beowulf Cluster of these!
  No need ... give it a while and people will be saying "64 Cores ought to be enough for anyone." Then again, GPUs passed that count long ago.
  - - Re: (Score:2)
      
      by BarbaraHudson ( 3785311 ) writes:
      
      The nVidia Tesla CUDA GPU cards have thousands if core [nvidia.com]
      Q: What is NVIDIA Tesla?
      With the world’s first teraflop many-core processor, NVIDIA® Tesla computing solutions enable the necessary transition to energy efficient parallel computing power. With thousands of CUDA cores per processor , Tesla scales to solve the world’s most important computing challenges—quickly and accurately.
      One example [nvidia.ca] Tesla K40: 2880 CUDA cores. That's a LOT of cores.
Next step: Send consultants to MySQL. (Score:3)

by SuricouRaven ( 1897204 ) writes: on Sunday November 15, 2015 @01:20PM (#50935093)

"Hi. We're from Intel, and we'd like to take a look at your multithreading, such as it is."

Share
twitter facebook
- Re: (Score:2)
  
  by TFlan91 ( 2615727 ) writes:
  
  Waste of time...
  Send them to pg, make it even better
This is all well and good... (Score:2)

by Type44Q ( 1233630 ) writes:

This is all well and good but I have to wonder: is this thing still optimized for single-threaded performance??
Nice, but... (Score:2)

by Brad1138 ( 590148 ) writes:

How much does it really matter anymore? For 99% of the population, any top end computer built in the last 5+ years is so damn fast, it will be fine for the next 10 years. Unless we see the 100x (or whatever) increase with quantum computing, these small incremental improvements are fairly pointless.
- Re:Nice, but... (Score:4, Funny)
  
  by jeffb (2.718) ( 1189693 ) writes: on Sunday November 15, 2015 @02:14PM (#50935329)
  
  If you don't have at least 10 cores, how can you expect to run the ads, tracking software and gratuitous animations required to fully participate in the online society of the late 20-teens?
  
  Parent Share
  twitter facebook
Intel often communicates poorly. (Score:2)

by Futurepower(R) ( 558542 ) writes:

Why is Intel introducing a new Broadwell processor? Why not Skylake?

Broadwell was a "Tick". Skylake is the improvement called "Tock".
- Re: Intel often communicates poorly. (Score:4, Interesting)
  
  by Fwipp ( 1473271 ) writes: on Sunday November 15, 2015 @02:35PM (#50935399)
  
  "Intel has made a habit of launching enthusiast versions of previous generations processors after it releases it a new architecture."
  
  Parent Share
  twitter facebook
  - What does that say about Intel? (Score:2)
    
    by Futurepower(R) ( 558542 ) writes:
    
    Yes, but why?
    - Re: (Score:2)
      
      by Kjella ( 173770 ) writes:
      
      Yes, but why?
      The enthusiast desktop CPUs are a spin-off from the Xeon server CPUs. They spend much longer time validating those than mainstream laptop/desktop CPUs, so on any new architecture/process they're likely to arrive last. The upside and/or downside is that it might still introduce new features or standards like say DDR4 ahead of the consumer CPUs, but sometimes at a high cost.
      - Interesting. (Score:2)
        
        by Futurepower(R) ( 558542 ) writes:
        
        Interesting. I think Intel should do that kind of explaining.
- Re: (Score:2)
  
  by Nemyst ( 1383049 ) writes:
  
  Yields. When intel releases a new processor line, yields are still pretty low, especially towards the high end. That's why you have binning and so many different processors - so they can recycle a top-end processor as a mid or high-end processor should parts of it end up subpar (though this is more popular in GPUs these days).
  
  As the line ages, yields improve and they generally iterate over the design in smaller ways to obtain even better efficiency or iron out issues. It's at that point that it becomes ve
  - Typical Intel confusion (Score:2)
    
    by Futurepower(R) ( 558542 ) writes:
    
    Thanks for the explanation.
    
    "(so Broadwell-E is 6000 like Skylake processors)"
    
    That, to me, seems like Intel being typically Intel. That creates confusion, instead of communicating clearly.
    
    A long time ago, I wanted to order some Intel motherboards. I needed the part numbers. It required 2 hours to get the numbers.
    
    Several years ago, I mentioned an error in the Intel web site to an Intel customer service employee. He said, "Oh, we are re-doing our web site." A year later, I happened to get the same
VMWare whitebox heaven (Score:3)

by barc0001 ( 173002 ) writes: on Sunday November 15, 2015 @06:21PM (#50936441)

This will be nice to pop into a whitebox VMWare ESXi machine. Definitely cheaper than a 2 x 6 core build.

Share
twitter facebook
- Re: (Score:2)
  
  by swb ( 14022 ) writes:
  
  If only they would pair it with a desktop board that could take 256 GB RAM.
  I find that I eat all my disk i/o and RAM way before my cpu.
  - Re: (Score:2)
    
    by barc0001 ( 173002 ) writes:
    
    Yeah that's part of the problem, but for some of our dev workloads we only use 2GB of RAM per VM but hammer the processor so this is a good niche fit. And Gigabyte's got some workstation boards that go to 64GB but also cost more so it's a trade off - and it's not a sure thing they'll support these chips. Obviously it's not for everyone.
    - Re: (Score:2)
      
      by Blaskowicz ( 634489 ) writes:
      
      There are desktop boards that support a theoretical 512GB or 768GB memory, if you go with registered ddr4.
      Look for the "pro" chipset, C612.
      Needs a Xeon E5-1xxx - the leading 1 says it works only in single CPU mode - which is about the same as an i7 anyway.
Speed fallacies. (Score:2)

by stoatwblr ( 2650359 ) writes:

"it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo)."
Not by much.
If you want to see the true speed of any CPU, look at the memory speed. Internal multipliers make some steps run faster but the overall effect isn't high enough to justify the cost deltas on the higher-clockrate CPUs. In general the sweetspot is 2-4 steps below the top step.
If you have a proper multitasking operating system it will take as much advantage of extra processo
Gillette (Score:2)

by DarthVain ( 724186 ) writes:

I hear it is used to power the new Gillette razor with 6 blades...
- Re: 20 cores DOES matter (Score:2)
  
  by iamacat ( 583406 ) writes:
  
  How is your SSD speed and thermal envelope doing with make -j20?
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Good question - it might end up clamped by other things, but at least up to -j 8, it scales decently well on a current CPU. It's still a big win over -j 4, say.
    Compilation has a lot of non-local data access that pays a heavy price for cache misses, so it's not so much memory bandwidth sensitive as it seems. During those cache miss latency periods, other cores can still be doing something.
    It might be that 20 is too much to hope for, I don't know. I'm pretty sure it'll scale beyond 8 though.
  - Re: 20 cores DOES matter (Score:5, Interesting)
    
    by m.dillon ( 147925 ) writes: on Sunday November 15, 2015 @02:02PM (#50935275) Homepage
    
    Actually, parallel builds barely touch the storage subsystem. Everything is basically cached in ram and writes to files wind up being aggregated into relatively small bursts. So the drives are generally almost entirely idle the whole time.
    It's almost a pure-cpu exercise and also does a pretty good job testing concurrency within the kernel due to the fork/exec/run/exit load (particularly for Makefile-based builds which use /bin/sh a lot). I've seen fork/exec rates in excess of 5000 forks/sec during poudriere runs, for example.
    -Matt
    
    Parent Share
    twitter facebook
    - - Re: 20 cores DOES matter (Score:5, Informative)
        
        by m.dillon ( 147925 ) writes: on Sunday November 15, 2015 @03:14PM (#50935525) Homepage
        
        Urm. And you've investigated this and found that your drive is pegged because? Of What? Or you haven't investigated this and you have no idea why your drive is pegged. I'll take a guess... you are running out of memory and the disk activity you see is heavy paging.
        Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.
        If I do something less strenuous, like a buildworld or buildkernel, almost the same result. Cpu is mostly pegged, disk activity is zero for the roughly 30 minutes the buildworld takes. However, smaller builds such as a buildworld or buildkernel, or a linux kernel build, regardless of the -j concurrency you specify, will certainly have bottlenecks in the build subsystem that have nothing to do with the cpu. A little work on the Makefiles will solve that problem. In our case there are always two or three ridiculously huge source files in the GCC build that the Make has to wait for before it can proceed with the link pass. Similarly with a kernel build there is a make depend step at the beginning which is not parallelized and the final link at the end which cannot be parallelized which actually take most of the time. Compiling the sources in the middle finishes in a flash.
        But your problem sounds a bit different... kinda sounds like you are running yourself out of memory. Parallel builds can run machines out of memory if the dev specifies more concurrency than his memory can handle. For example, when building packages there are many C++ source files which #include the kitchen sink and wind up with process run sizes north of 1GB. If someone only has 8GB of ram and tries a -j 8 build under those circumstances, that person will run out of memory and start to page heavily.
        So its a good idea to look at the footprint of the individual processes you are trying to parallelize, too.
        Memory is cheap these days. Buy more. Even those tiny little BRIX one can get these days can hold 32G of ram. For a decent concurrent build on a decent cpu you want 8GB minimum, 16GB is better, or more.
        -Matt
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by MrKaos ( 858439 ) writes:
        
        Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.
        Have you ever considered using sar to check the amount of minor page faulting going on? It would be interesting to measure the activity between L3 cache and memory, it's possible that memory is thrashing as the CPU scheduler attempts to divide time between logical cores, that are actually the same physical core.
        My applications are messaging systems so they aren't transient processes, like a compile. Over 20,000 applications that is a lot of context switching and from what you have described he amount of la
        
        Re: (Score:2)
        
        by m.dillon ( 147925 ) writes:
        
        If we're talking about bulk builds, for any language, there is going to be a huge amount of locality of reference that matches well against caches. shared text RO, lots of shared files RO, stack use is localized (RW), process data is relatively localized (RW), and file writeouts are independent. Plus any decent scheduler will recognize the batch-like nature of the compile jobs and use relatively large switch ticks. For a bulk build the scheduler doesn't have to be very smart, it just needs to avoid movin
        
        Re: (Score:2)
        
        by stoatwblr ( 2650359 ) writes:
        
        "Memory bandwidth can become an issue"
        Bandwidth is almost never an issue. _Latency_ is another matter.
        There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it. The big
        
        Re: (Score:2)
        
        by MrKaos ( 858439 ) writes:
        
        "Memory bandwidth can become an issue"
        Bandwidth is almost never an issue. _Latency_ is another matter.
        Thanks stoatwblr, that's exactly what I was talking about.
        There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it.
        Which is why I'm always trying to tune the amount of application latency so the CPU cycles can be used for actual work. Obviously everyone's workloads are different but it makes sense to provide a bit of a helping hand to the machine (usually so I can go home)
        The big improvements have been around the number of requests you can make while waiting for that answer instead of being in request-answer lockstep and there is only so far that can be taken.
        That's an interesting development I hadn't heard off. That would have a major impact on reducing application latency, I think I'll have to get my head around how it will affect CPU scheduler beh
        
        Re: (Score:2)
        
        by Coren22 ( 1625475 ) writes:
        
        Even those tiny little BRIX one can get these days can hold 32G of ram.
        Do you have a link for what you are talking about? My desktop has 32 GB at home, and it has a pretty shitty MB, but I am not sure what you mean by a BRIX.
        I found this on Google:
        http://www.gigabyte.us/product... [gigabyte.us]
        But as far as I see, they only support 2 SoDIMM (DDR3L), which many of the specs pages list as 2x8GB max, so I don't know if this is what you are talking about.
        
        Re: (Score:2)
        
        by doublebackslash ( 702979 ) writes:
        
        Replying to undo incorrect moderation. Sorry!
  - Re: (Score:2)
    
    by JoeyRox ( 2711699 ) writes:
    
    NVME SSDs can do 100k random IOPs. That's an I/O completing every 10us. That's plenty fast to keep a 20-thread compiling pipeline busy with data.
  - - Re: (Score:2)
      
      by PopeRatzo ( 965947 ) writes:
      
      I do have my salt water fish shipped to me in thermal envelopes.
      
      Where do you buy your salt water fish? I have a taste for some sea bass, broiled with olives and capers.
      - Something fishy about this (Score:2)
        
        by fyngyrz ( 762201 ) writes:
        
        I generally buy them from liveaquaria.com. For my aquariums, not for my dinner. :)
        
        Re: (Score:2)
        
        by PopeRatzo ( 965947 ) writes:
        
        I generally buy them from liveaquaria.com. For my aquariums, not for my dinner. :)
        And they show up alive? I'm always a little amazed when people get live animals shipped to them. I have a friend who's an urban beekeeper who gets live bees fed-exed to them. What a country.
        
        Re: (Score:2)
        
        by Muad'Dave ( 255648 ) writes:
        
        You can get day-old chicks thru the post office [usps.gov]. Your friend's bees are listed at the top of that same page.
        
        Re: (Score:2)
        
        by fyngyrz ( 762201 ) writes:
        
        Yes, they show up alive and doing quite well. They pack them in plastic bags, some of which have black light-shields for the species that are prone to shock from sudden changes in light intensity, all inside said thermal envelope (a Styrofoam cooler, essentially.) They put a heating or cooling chemical packet in there with them, depending on the season, and then ship them overnight by FedEx or UPS. I unpack them immediately upon receipt, gradually acclimate them to the water and temperature they'll be livin
        
        Re: (Score:2)
        
        by PopeRatzo ( 965947 ) writes:
        
        Very cool.
  - - Re: (Score:2)
      
      by Coren22 ( 1625475 ) writes:
      
      Is the CPU cooler able to keep up with the CPU doing that much work, or is the CPU forced to throttle back to prevent overheating.
- Re: (Score:2)
  
  by ArchieBunker ( 132337 ) writes:
  
  You can pick up hex core boxes dirt cheap on eBay now. Look for the X5650 or W3680 Xeon boxes.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Both of which have such slow single threaded performance that most people are better off with an i3.
    I have a W3680 box that runs 8 VDI boxes through citrix xendesktop and the performance is pretty good for that purpose but for gaming its 25-30% slower than my dual core i3 4160 in my cheapy shit gaming box
    - - Re: (Score:2)
        
        by ArchieBunker ( 132337 ) writes:
        
        Interesting link. My Lenovo box has the an X58 motherboard but I don't think the stock bios will allow for much tweaking. This was still a huge speed boost from my old Q6600 box.
- Re: (Score:1)
  
  by U2xhc2hkb3QgU3Vja3M ( 4212163 ) writes:
  
  Me... I'll have to limp along on my existing 2 core 4 thread CPU.
  You think your CPU isn't good enough? I'm still using a Core 2 Duo here!
- Re: (Score:2)
  
  by BLKMGK ( 34057 ) writes:
  
  Video compression. I have tested faster speed CPUs with fewer cores against slower speed with more cores - more cores won. I'm VERY interested in this but suspect the 8 core overclocked might be the fiscally responsible way to go. I'd use a XEON but you cannot overclock them...
  I do have an ESX server but I've never been able to find a good compression appliance to use. Something I could use with say a web front-end to upload and just settings for ffmpeg would rock. Anyone?
  - Re: (Score:2)
    
    by Coren22 ( 1625475 ) writes:
    
    I would love a video conversion appliance. I want to pull movies off the Tivo and autoconvert them to MP4, as well as any movies in my collection in random video formats do the same. That would be very nice to have something that I could throw on the ESX box to do all that work with some minimal configuration.
- Re: (Score:2)
  
  by Big Hairy Ian ( 1155547 ) writes:
  
  This will kick in in Data Centers where you'll be able to run twice as many virtual machines per physical processor
- - - Re:20 cores DOES matter (Score:5, Interesting)
      
      by fyngyrz ( 762201 ) writes: on Sunday November 15, 2015 @01:02PM (#50935015) Homepage Journal
      
      I was under the (admittedly vague) impression that was true only if the thread was using floating point.
      CPUs that offer more cores and/or threads than they do FPUs is one of the reasons I write a lot of my multi-threaded stuff (image and baseband RF processing) utilizing appropriately scaled integer math.
      I have 8 cores with 8 FPUs on my desk, but many of my users are stuck with some of the wheezier I5 variants.
      
      Parent Share
      twitter facebook
      - Re: (Score:1)
        
        by Anonymous Coward writes:
        
        As a side benefit, "appropriately scaled integer math" can be better in other ways too, depending on the nature of your problem. It doesn't pile all its precision up right near zero, but gives you an even distribution of the precision across the range. That might (or might not) be a much better fit for your problem, depending on what you're doing.
        
        Re: (Score:3)
        
        by fyngyrz ( 762201 ) writes:
        
        Yep. Although I have to say, I really enjoy working with the domain 0.0-1.0; there are so many neat tricks that can be pulled. You can do them in integer too, but there is hoop-jumping involved.
      - Re: (Score:2)
        
        by Carewolf ( 581105 ) writes:
        
        I was under the (admittedly vague) impression that was true only if the thread was using floating point.
        No, that is the AMD Bulldozer design. Hyper-threading provides no extra CPU power for the additional threads, how much performance you get out of the extra threads depend on how much the threads are forced to stall due to memory access, if they all do integer/fp instructions and memory access that hits the cache, you get only 50% normal performance out of each hyper thread.
        
        Re: (Score:2)
        
        by Bengie ( 1121981 ) writes:
        
        Hyperthreading doesn't just benefit on memory stalls. Intel Haswell has 8 execution units and can retire 4 instructions per cycle. Any time there is a free execution unit and 4 instructions are not in flight, hyperthreading can schedule on the other thread. Intel put a lot of effort into increasing out of order execution performance, but not all work loads benefit a whole lot of OoO. This is where HT comes in. When a work load can not fully utilize OoO, much of the CPU core will be idle, and HT allows these
    - Re: (Score:3)
      
      by beelsebob ( 529313 ) writes:
      
      Apparently the last time you checked was in 2004. Hyper threading gets you a lot more than a 10% gain on modern CPUs.
      - Re:20 cores DOES matter (Score:4, Interesting)
        
        by nadaou ( 535365 ) writes: on Sunday November 15, 2015 @04:00PM (#50935721) Homepage
        
        It depends on the task. For double precision FP calculations using MPI multi-processing (e.g. FORTRAN CFD), the extra overhead of the extra cores talking to each other mostly cancel out the gains.
        For many many small short-lifetime processes you'll probably do better.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by Ambassador Kosh ( 18352 ) writes:
        
        Use MPI BETWEEN nodes and OpenMP WITHIN a node. MPI is always going to be slower on a single node than threads are and for HPC type loads it can easily be 100x slower to use processes instead of threads due to communications overhead.
    - Re:20 cores DOES matter (Score:4, Informative)
      
      by gman003 ( 1693318 ) writes: on Sunday November 15, 2015 @02:09PM (#50935305)
      
      You likely have not checked for a while. I saw figures of 120% performance ("each core at 60% performance" as you put it) back under the Pentium 4 HT, 140% under Nehalem/Sandy Bridge, and 150% under Haswell.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by gman003 ( 1693318 ) writes:
        
        It is a complicated subject. Some tasks do not benefit from HT - those whose memory access fits entirely within cache, and who make use of operations that cannot be spread among execution units in a core (or where the pattern of operations is superscalar with a single thread).
        Simultaneous multithreading (the non-trademark name for HT) offers benefits in certain situations. First, where the memory access pattern is unpredictable and/or uncachable - it essentially lets one thread keep the core working while t
    - Re:20 cores DOES matter (Score:5, Informative)
      
      by m.dillon ( 147925 ) writes: on Sunday November 15, 2015 @02:16PM (#50935339) Homepage
      
      Hyperthreading on intel gives about a +30 to +50% performance improvement. So each core winds up being about 1.3 to 1.5 times the performance with two threads verses 1.0 with one. Quite significant. It depends on the type of load, of course.
      The main reason for the improvement is of course due to one thread being able to make good use of execution units while the other thread is stalled on something (like memory or TLB, significant integer shifts, or dependent Integer or FPU multiply and divide operations).
      -Matt
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
        
        Re: (Score:2)
        
        by TFloore ( 27278 ) writes:
        
        Since the inception of HT, is there a reason CPU design hasn't advanced to the point of executing 4 threads per core rather then the 2 it always has been?
        Workload and system balance, mostly.
        If you look back several years (2008? earlier?) you'll see some Sun Sparc designs, and some IBM POWER designs, that supported 4 or 8 threads per core. They worked well for very specific workloads and applications.
        The Sun Sparc designs with 8 threads per core were mostly tailored for "simple" highly-scalable web servers,
        
        Re: (Score:2)
        
        by TheRaven64 ( 641858 ) writes:
        
        SMT addresses two problems. The first is that, when a thread is completely stalled waiting for memory, you can try to run another thread. This doesn't happen that often with superscalar out-of-order cores, as you can also schedule other instructions from the same thread if there are no dependencies. The second is that a fetch granule from one thread may not be able to issue instructions to all of the execution units in a single cycle, giving you more to choose from.
        The down side of SMT is that you incre
    - Re: (Score:2)
      
      by Bengie ( 1121981 ) writes:
      
      A quick Google returns many people asking this question and others showing Make having a 20%-33% improvement with hyperthreading. That's a decent improvement.
  - - Memory isn't neccesarily the bottleneck. (Score:2)
      
      by fyngyrz ( 762201 ) writes:
      
      It's not that straight-forward. Depends on how much time the threads spend in the cache, and how much time they spend waiting on the FPU.
    - Re: (Score:2)
      
      by TheRaven64 ( 641858 ) writes:
      
      As someone who does -j32 on our existing systems and sees almost linear speedup, I disagree. If you've got a decent amount of RAM, you won't hit the storage at all - everything will be in the buffer cache. Memory bandwidth is not an issue either - compiling generally has good locality of reference and so the cache works well.
- Re: Sounds like fun (Score:1)
  
  by Anonymous Coward writes:
  
  Thank you for sharing. I was lost without your insightful comment!
- Re: (Score:2)
  
  by Opportunist ( 166417 ) writes:
  
  Don't worry. Windows will bloat to fill the additional cores.
- Re: (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  Do modern, development-focused CS degree programs talk about multiprocessing?
  They did in 1990. Oh wait, that subject was run by an electrical engineering department so it may take a while for others to catch up.
- Re: (Score:2)
  
  by gl4ss ( 559668 ) writes:
  
  you would think that they do since it's essential to making anything in java or making anything in android or making anything in just about everything except javascript nowadays.
  anyhow.. our multi thread parallel programming course was in ADA.. it was in ADA so that we could use the language built in mechanisms and skip learning anything. I guess there was some rationale but it was probably the same rationale that was used for learning microcontroller programming with assembly for some hitachi pos.
- Re: (Score:2)
  
  by MrKaos ( 858439 ) writes:
  
  "a whopping 25MB of L3 cache" souds like "a whopping 20Mb capacity" for a brand new hard disk in the early 1990's...
  *Whooosh*

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Software needs to catch up (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Wrong specs on Skylake (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

400 Thread Count (Score:1)

Re: (Score:1)

Re: (Score:2)

AMD's response? (Score:4, Interesting)

Re:AMD's response? (Score:5, Informative)

Re: (Score:2)

Re:AMD's response? (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re:AMD's response? (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Into the Wayback Machine Sherman! (Score:2)

Re: (Score:2)

Re: (Score:2)

Next step: Send consultants to MySQL. (Score:3)

Re: (Score:2)

This is all well and good... (Score:2)

Nice, but... (Score:2)

Re:Nice, but... (Score:4, Funny)

Intel often communicates poorly. (Score:2)

Re: Intel often communicates poorly. (Score:4, Interesting)

What does that say about Intel? (Score:2)

Re: (Score:2)

Interesting. (Score:2)

Re: (Score:2)

Typical Intel confusion (Score:2)

VMWare whitebox heaven (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Speed fallacies. (Score:2)

Gillette (Score:2)

Re: 20 cores DOES matter (Score:2)

Re: (Score:1)

Re: 20 cores DOES matter (Score:5, Interesting)

Re: 20 cores DOES matter (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Something fishy about this (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:20 cores DOES matter (Score:5, Interesting)

Re: (Score:1)