Intel Flagship Core i7-6950X Broadwell-E To Offer 10-Cores, 20-Threads, 25MB L3 (hothardware.com) 167
MojoKid writes: Intel has made a habit of launching enthusiast versions of previous generation processors after it releases a new architecture. As was the case with Intel's Haswell architecture, high-end Broadwell-E variants are expected and it looks like Intel is readying a doozy. Recently revealed details show four new processors under the new HEDT (High-End Desktop) banner for Broadwell, which is one more SKU than Haswell-E brought to the table. The most intriguing of the new chips is the Core i7-6950X, a monster 10-core CPU with Hyper Threading support. That gives the Core i7-6950X 20 threads to play with, along with a whopping 25MB of L3 cache. The caveat is the CPU's clockspeed — it will run at just 3.0GHz (base), so for applications that aren't properly tuned to take full advantage of large core counts and threads, it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo).
Software needs to catch up (Score:2)
Mainstream programming languages are still sequential by default and the likes of OpenCL are too hard to learn for simple tasks. UI code is still single threaded in most systems, and that drags most computation into that thread as well through programmer laziness. It's time for languages which are parallel by default and where ability to parallelize a loop is verifiable at compile time. Yes I know FORTRAN is much closer to that then C/Java, but that's due to being primitive to a degree that will not fly in
Re: (Score:2)
Julia and Rust have some intriguing parallelisation mechanisms.
What I would like to write is code that has a dependency graph (or the compiler figures out the dependencies and parallelises by itself).
In the meantime I simple write code for 1 processor, and run that on many data sets in parallel (using make or doit [pydoit.org]).
Re: (Score:2)
Julia and Rust have some intriguing parallelisation mechanisms. [...] I simple write code for 1 processor, and run that on many data sets in parallel
I haven't found Julia's parallelism very efficient, but maybe it's just my lack of coding skills. Then again my work [algoristo.com] is rather parallel by nature (independent pixels), so I simply run several processes using shell scripts.
For example, it would be nice if something like map() were always parallelized as it kind of assumes independent data points, but there are still other considerations like memory management. Julia's pmap() seems to have too much overhead to be of any help, especially when the separate p
Re: (Score:2)
Java is pretty easy to write multi-threaded code.
Re: (Score:2)
Multithreading requires every instance of concurrent execution to be micromanaged by the programmer, leading to a lot of code which is not parallelized in practice. Potential concurrency should be the default case for, say, all for loops and serialization an explicit paradigm that a programmer is aware of. Coupled with strong compile time checking that can detect safe and unsafe code.
Re: (Score:2)
Here's an example from Oracle: .parallelStream() .filter(p -> p.getGender() == Person.Sex.MALE) .mapToInt(Person::getAge) .average() .getAsDouble();
double average = roster
Lamba's and extensions to the Collections framework have made parallel loops simple.
You can't have the compiler parallelize loops if the methods you call can be overridden. Forcing every method called to be final, just so you can optimize some loop is a little daft. It's also pointless (read: slower) parallelizing a loop that only runs
Re: (Score:2)
Nothing has to be made final, subclasses just need to obey the contract declared by superclass. This can be accomplished, in the worst case, by making everything synchronized.
Make default for loop potentially parallel and have compiler complain if it can not prove that by either code inspection or, as a last resort, explicit annotation on the loop or methods that it calls. Then introduce an sfor keyword for when you really have to make things sequential.
Re: (Score:2)
IMHO there should be 3 basic types of loops: FOR, PAR, SEQ. FOR loops can be parallelized if possible by the compiler, PAR loops have parallel semantics and SEQ loops have sequential semantics.
Then most loops can use the FOR variant but when advantageous the programmer can use PAR or SEQ depending on the situation and/or to help the compiler and improve readability.
Re: (Score:2)
Slapping synchronized on a method doesn't solve all your multi-threaded problems.
The simplest example is probably a Hashtable, all methods are synchronized.
If you want to replace a value in the map, calling get, then set is effectively a read-modify-write, that operation needs to be protected by enclosing it all in a synchronized block.
You'll also risk dead locks.
If Object A's synchronized method calls Object B's synchronized method and vise versa, having two threads calling both those methods risks a dead
Re: (Score:2)
It's not just the programming languages. Most *tasks* of any complexity tend to be highly sequential in nature. There are some rare exceptions, but the notion that a language can just automatically parallelize loops and get some massive speedup is not very feasible, I hate to say. It tends to work best in highly contrived or specialized situations. You have to be running some serious computation in a LOT of loops for that to pay off in any way, or the overhead is simply a non-starter. Moreover, those c
Re: (Score:3)
Really? Every browser window or tab should hang if Javascript in one of them is slow? Loading and decoding an image for one of the icons on screen should prevent the UI from processing touch events? Many of those problems have been solved by important applications ad-hoc, but sane behaviour by default would be great.
Re: (Score:2)
A web browser is somewhat of a unique case, as each tab is more or less equivalent to a separate application - or at least, it should be. Chrome certainly proved it can effectively be done that way, I think. I agree - a single page should not be able to slow down the entire browser - that's terrible design.
When I was talking about UI, I was talking more like .NET's WPF or Qt, perhaps, neither of which are thread-safe because of performance concerns. It's critical for the programmer to do the work of hand
Re: (Score:2)
Basically every Java program I have ever seen is multithreaded. (that only excludes hello world programs etc.) ... the first one is a particular bad example.
No idea why you explicitely mentioned Java and C#
Re: (Score:2)
I can understand how threading can be hard for systems that are very latency sensitive like 3D games, but anything that just needs scaling and throughput, threading is brain dead easy.
Wrong specs on Skylake (Score:4, Informative)
Re: (Score:2)
Re: (Score:2)
Underpowered, but for most people that would be the most powerful GPU they ever had on a PC.
You need more for games but we're long past diminishing returns and with passing times you get to use uglier versions of Windows, also games are all tied to "app stores" like Steam and others so you can't sell or lend them and the store spies on you. Boring.
I can see myself upgrading to the next generation of APUs (better CPU, ddr4 support, supported by the new free driver + non-free blob architecture for linux) and
Re: (Score:2)
However, for what AMD is selling it for, it's a better deal than the comparatively priced Intel chip, especially if you're doing anything that can make use of all of those cores.
Re: (Score:2)
5.5? My 7850k replaced an Intel P4 @3.1. At 5Ghz I would need liquid nitrogen cooling. Not worth it imo. I'll take a hign end xeon over any skylake or other I7 flagship any day all day. Thanks for pushing my envelope and opening my eyes.
You missed the point...
A 2.4GHz Core2Duo would crush that P4 outright, even in single thread work.
A P4 955 EE running at 3.64GHz (dual core) is outright crushed by a 3.3Ghz Core2Duo, sometimes by a factor of 4.
Clock speed is just one measure of performance. A Skylake 6700K is, even at stock speeds, generally faster than anything AMD makes, even overclocked.
400 Thread Count (Score:1)
C'mon Intel, everyone knows that 400 thread count is the minimum needed for a good night's sleep.
Re: (Score:1)
Nope, 400 still isn't enough. A lot of us are still waiting for a CPU with over 9000 threads.
Re: (Score:2)
Pfft... 1500 count Egyptian, or I'm going home!
(No, not really. I don't actually know what the thread count is. One, I'm in a hotel STILL and, two, I don't actually buy my own bedding at home.)
AMD's response? (Score:4, Interesting)
Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.
For now, they've got the consoles holding them afloat. And while I am an AMD fan, I see they are rapidly losing out on the desktop space when it comes to performance (despite both companies having rather meager performance gains for the past several years.)
They'd better figure out what the fuck they're doing, and come up with some competing responses, quickly. Hell, I've got ideas for them, all involving that HBM tech.
1. Use a modified version of that HBM tech to stack their CPU cores and load it up with tons of cache memory (for their non-APU line.) And don't forget to drop a process node, for fuck's sake.
2. Use modified HBM tech to create stacked CPU/GPU/RAM/CACHE on the same die (for their APU line.)
3. Use modified HBM to create stacked single-die CrossFire GPUs that don't consume gobs of power (GPU line.)
4. Use modified HBM tech to create a true monolithic SOC package that integrates EVERYTHING, thus eliminating the need for motherboards - at that point and time, it just becomes a breakout board with a socket. They could probably do away with the interposer as well if They were clever enough in the design.
Re:AMD's response? (Score:5, Informative)
Re: (Score:2)
"You can't stack CPU cores because they output waaaaaay too much heat."
Microfluidic cooling to an IHS. Easy-peasy.
Re:AMD's response? (Score:4, Insightful)
Being able to describe something in five words does not make it easy.
Re: (Score:2)
It's that easy as we're doing it to make large linear high-power LEDs run very cool.
Protip for AMD: Start with a thicker wafer, do the underside with your required cooling channels, do the topside with your typical litho. You CAN wick heat from the underside.
Re: (Score:2)
Assuming Intel doesn't go Xeon-scale in pricing for this CPU (who am I kidding, of course they will) I wonder how AMD plans to respond to this.
Fighting the battles they can win, or are at least less likely to lose. This is a halo product of a server line of chips and getting Opterons back in the data center takes more time for validation and convincing conservative enterprises than AMD has. Zen will launch to compete with Intel's mainstream dual/quad-core chips, even if it pulls off a miracle I'm guessing it'd take at least a year or two until AMD is back to a full top-to-bottom stack.
Re:AMD's response? (Score:5, Informative)
AMD has been developing a new microarchitecture, Zen, which will replace the horribly-designed Bulldozer. It's rumored to be made on a 14nm node, and they re-hired the guy who designed the K10 architecture (aka the last good CPUs AMD made), so I expect it to be reasonably competitive with Intel. I really hope it is, at least.
Your terminology is completely out of whack ("stacked single-die CrossFire GPU" is a phrase with more contradictions than whitespace characters), but I'll analyze what you were trying to say instead of what you actually said:
#1: Current chip-stacking tech doesn't allow for all that much bandwidth between chips, especially when going above two layers. CPU cores need a pretty hefty amount of bandwidth to their cache, so that's already problematic. Stacking dies also limits thermal performance - if you stack two dies, you have 2x the heat in 1x the heat-conducting surface area. For low-power stuff, that's fine, but CPU cores get pretty hot. Many high-performance dies are already performance-constrained by how much heat they can conduct to their cooler.
#2. This is a good idea. Or rather, the good idea is "APU on an interposer using HBM for main memory". You'd need bigger CPU caches - HBM is ridiculously high-latency even by VRAM standards, it will really hurt CPU performance otherwise. And it will limit upgradability - no way to just pop another DIMM of DDR3 in there. But the GPU gains should be worth it.
#3. Again, thermals will absolutely prevent you from stacking GPU dies. HBM and stacking doesn't do ANYTHING for the power efficiency of the chips you're stacking, so that's two 100W+ dies on top of each other. Not gonna happen. You could stack them side-by-side on an interposer, but at that point why not just fabricate them as one die?
#4. The cost of an interposer is significantly greater than that of a printed circuit board, and a lot of stuff won't benefit from the greater bandwidth to the CPU - stuff like a USB controller or audio chipset. Stacking the dies is also more expensive than just using a PCB - it's done in phones where space is REALLY constrained, but even the smallest desktops aren't that tight for space yet. So all that's left is putting everything onto one die - which runs into yield problems, because with bigger individual dies, a single defect will wipe out a lot more silicon. AMD actually *is* already doing this with their lowest-end laptop/desktop parts - look at Socket AM1, there's not much on the motherboard besides external connectors and power-delivery circuits. But they're also pretty low-end in performance.
Re: (Score:3)
We have microfluidics for stacking dies and removing heat. We do it on p-n junctions on some of the latest LEDs (which are fucking MASSIVE at nearly 7mm x 7mm on just the die alone, not including any mount, circuitry, etc.) to keep them very cool.
I don't speak of ideas unless I already know we've got the technology to handle it.
Re: (Score:2)
49mm^2 is "massive"? A high-end processor is 500-600mm^2. And even if microfluidics works to remove heat (how do you have a layer with both enough fluid channels to cool, and enough TSVs for communication?), that will increase your cost substantially. I would expect $1K+ for a quad-core CPU under this kind of design.
Re: (Score:2)
Not very long ago I got a quote for an 80 thread Intel machine (4 Xeons) which turned out to be slightly less than ten times the price for a 64 core AMD machine with the same clock speed, memory capacity, disks etc. The Xeon machine would perform better than a singl
Into the Wayback Machine Sherman! (Score:2)
Imagine a Beowulf Cluster of these!
Re: (Score:2)
Imagine a Beowulf Cluster of these!
No need ... give it a while and people will be saying "64 Cores ought to be enough for anyone." Then again, GPUs passed that count long ago.
Re: (Score:2)
Q: What is NVIDIA Tesla?
With the world’s first teraflop many-core processor, NVIDIA® Tesla computing solutions enable the necessary transition to energy efficient parallel computing power. With thousands of CUDA cores per processor , Tesla scales to solve the world’s most important computing challenges—quickly and accurately.
One example [nvidia.ca] Tesla K40: 2880 CUDA cores. That's a LOT of cores.
Next step: Send consultants to MySQL. (Score:3)
"Hi. We're from Intel, and we'd like to take a look at your multithreading, such as it is."
Re: (Score:2)
Waste of time...
Send them to pg, make it even better
This is all well and good... (Score:2)
Nice, but... (Score:2)
Re:Nice, but... (Score:4, Funny)
If you don't have at least 10 cores, how can you expect to run the ads, tracking software and gratuitous animations required to fully participate in the online society of the late 20-teens?
Intel often communicates poorly. (Score:2)
Broadwell was a "Tick". Skylake is the improvement called "Tock".
Re: Intel often communicates poorly. (Score:4, Interesting)
"Intel has made a habit of launching enthusiast versions of previous generations processors after it releases it a new architecture."
What does that say about Intel? (Score:2)
Re: (Score:2)
Yes, but why?
The enthusiast desktop CPUs are a spin-off from the Xeon server CPUs. They spend much longer time validating those than mainstream laptop/desktop CPUs, so on any new architecture/process they're likely to arrive last. The upside and/or downside is that it might still introduce new features or standards like say DDR4 ahead of the consumer CPUs, but sometimes at a high cost.
Interesting. (Score:2)
Re: (Score:2)
As the line ages, yields improve and they generally iterate over the design in smaller ways to obtain even better efficiency or iron out issues. It's at that point that it becomes ve
Typical Intel confusion (Score:2)
"(so Broadwell-E is 6000 like Skylake processors)"
That, to me, seems like Intel being typically Intel. That creates confusion, instead of communicating clearly.
A long time ago, I wanted to order some Intel motherboards. I needed the part numbers. It required 2 hours to get the numbers.
Several years ago, I mentioned an error in the Intel web site to an Intel customer service employee. He said, "Oh, we are re-doing our web site." A year later, I happened to get the same
VMWare whitebox heaven (Score:3)
This will be nice to pop into a whitebox VMWare ESXi machine. Definitely cheaper than a 2 x 6 core build.
Re: (Score:2)
If only they would pair it with a desktop board that could take 256 GB RAM.
I find that I eat all my disk i/o and RAM way before my cpu.
Re: (Score:2)
Yeah that's part of the problem, but for some of our dev workloads we only use 2GB of RAM per VM but hammer the processor so this is a good niche fit. And Gigabyte's got some workstation boards that go to 64GB but also cost more so it's a trade off - and it's not a sure thing they'll support these chips. Obviously it's not for everyone.
Re: (Score:2)
There are desktop boards that support a theoretical 512GB or 768GB memory, if you go with registered ddr4.
Look for the "pro" chipset, C612.
Needs a Xeon E5-1xxx - the leading 1 says it works only in single CPU mode - which is about the same as an i7 anyway.
Speed fallacies. (Score:2)
"it could potentially trail behind the Core i7-6700K, a quad-core Skylake processor clocked at 3.4GHz (base) to 4GHz (Turbo)."
Not by much.
If you want to see the true speed of any CPU, look at the memory speed. Internal multipliers make some steps run faster but the overall effect isn't high enough to justify the cost deltas on the higher-clockrate CPUs. In general the sweetspot is 2-4 steps below the top step.
If you have a proper multitasking operating system it will take as much advantage of extra processo
Gillette (Score:2)
I hear it is used to power the new Gillette razor with 6 blades...
Re: 20 cores DOES matter (Score:2)
How is your SSD speed and thermal envelope doing with make -j20?
Re: (Score:1)
Good question - it might end up clamped by other things, but at least up to -j 8, it scales decently well on a current CPU. It's still a big win over -j 4, say.
Compilation has a lot of non-local data access that pays a heavy price for cache misses, so it's not so much memory bandwidth sensitive as it seems. During those cache miss latency periods, other cores can still be doing something.
It might be that 20 is too much to hope for, I don't know. I'm pretty sure it'll scale beyond 8 though.
Re: 20 cores DOES matter (Score:5, Interesting)
Actually, parallel builds barely touch the storage subsystem. Everything is basically cached in ram and writes to files wind up being aggregated into relatively small bursts. So the drives are generally almost entirely idle the whole time.
It's almost a pure-cpu exercise and also does a pretty good job testing concurrency within the kernel due to the fork/exec/run/exit load (particularly for Makefile-based builds which use /bin/sh a lot). I've seen fork/exec rates in excess of 5000 forks/sec during poudriere runs, for example.
-Matt
Re: 20 cores DOES matter (Score:5, Informative)
Urm. And you've investigated this and found that your drive is pegged because? Of What? Or you haven't investigated this and you have no idea why your drive is pegged. I'll take a guess... you are running out of memory and the disk activity you see is heavy paging.
Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.
If I do something less strenuous, like a buildworld or buildkernel, almost the same result. Cpu is mostly pegged, disk activity is zero for the roughly 30 minutes the buildworld takes. However, smaller builds such as a buildworld or buildkernel, or a linux kernel build, regardless of the -j concurrency you specify, will certainly have bottlenecks in the build subsystem that have nothing to do with the cpu. A little work on the Makefiles will solve that problem. In our case there are always two or three ridiculously huge source files in the GCC build that the Make has to wait for before it can proceed with the link pass. Similarly with a kernel build there is a make depend step at the beginning which is not parallelized and the final link at the end which cannot be parallelized which actually take most of the time. Compiling the sources in the middle finishes in a flash.
But your problem sounds a bit different... kinda sounds like you are running yourself out of memory. Parallel builds can run machines out of memory if the dev specifies more concurrency than his memory can handle. For example, when building packages there are many C++ source files which #include the kitchen sink and wind up with process run sizes north of 1GB. If someone only has 8GB of ram and tries a -j 8 build under those circumstances, that person will run out of memory and start to page heavily.
So its a good idea to look at the footprint of the individual processes you are trying to parallelize, too.
Memory is cheap these days. Buy more. Even those tiny little BRIX one can get these days can hold 32G of ram. For a decent concurrent build on a decent cpu you want 8GB minimum, 16GB is better, or more.
-Matt
Re: (Score:2)
Let me rephrase... we do bulk builds with pourdriere of 20,000 applications. It takes a bit less than two days. We set the parallelism to roughly 2x the number of cpu threads available. There are usually several hundred processes active in various states at any given moment. The cpu load is pegged. Disk activity is zero for most of the time.
Have you ever considered using sar to check the amount of minor page faulting going on? It would be interesting to measure the activity between L3 cache and memory, it's possible that memory is thrashing as the CPU scheduler attempts to divide time between logical cores, that are actually the same physical core.
My applications are messaging systems so they aren't transient processes, like a compile. Over 20,000 applications that is a lot of context switching and from what you have described he amount of la
Re: (Score:2)
If we're talking about bulk builds, for any language, there is going to be a huge amount of locality of reference that matches well against caches. shared text RO, lots of shared files RO, stack use is localized (RW), process data is relatively localized (RW), and file writeouts are independent. Plus any decent scheduler will recognize the batch-like nature of the compile jobs and use relatively large switch ticks. For a bulk build the scheduler doesn't have to be very smart, it just needs to avoid movin
Re: (Score:2)
"Memory bandwidth can become an issue"
Bandwidth is almost never an issue. _Latency_ is another matter.
There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it. The big
Re: (Score:2)
"Memory bandwidth can become an issue"
Bandwidth is almost never an issue. _Latency_ is another matter.
Thanks stoatwblr, that's exactly what I was talking about.
There are a lot of tricks and bits to optimise things regarding locality (mainly around row based and lookahead accessing. CPUs aren't the only devices trying to predict what will be read next) and controller optimisation, but the underlaying dynamic ram itself hasn't actually improved much over the last 20 years in terms of time between addressing a random cell and getting an answer back from it.
Which is why I'm always trying to tune the amount of application latency so the CPU cycles can be used for actual work. Obviously everyone's workloads are different but it makes sense to provide a bit of a helping hand to the machine (usually so I can go home)
The big improvements have been around the number of requests you can make while waiting for that answer instead of being in request-answer lockstep and there is only so far that can be taken.
That's an interesting development I hadn't heard off. That would have a major impact on reducing application latency, I think I'll have to get my head around how it will affect CPU scheduler beh
Re: (Score:2)
Even those tiny little BRIX one can get these days can hold 32G of ram.
Do you have a link for what you are talking about? My desktop has 32 GB at home, and it has a pretty shitty MB, but I am not sure what you mean by a BRIX.
I found this on Google:
http://www.gigabyte.us/product... [gigabyte.us]
But as far as I see, they only support 2 SoDIMM (DDR3L), which many of the specs pages list as 2x8GB max, so I don't know if this is what you are talking about.
Re: (Score:2)
Replying to undo incorrect moderation. Sorry!
Re: (Score:2)
Re: (Score:2)
Where do you buy your salt water fish? I have a taste for some sea bass, broiled with olives and capers.
Something fishy about this (Score:2)
I generally buy them from liveaquaria.com. For my aquariums, not for my dinner. :)
Re: (Score:2)
And they show up alive? I'm always a little amazed when people get live animals shipped to them. I have a friend who's an urban beekeeper who gets live bees fed-exed to them. What a country.
Re: (Score:2)
You can get day-old chicks thru the post office [usps.gov]. Your friend's bees are listed at the top of that same page.
Re: (Score:2)
Yes, they show up alive and doing quite well. They pack them in plastic bags, some of which have black light-shields for the species that are prone to shock from sudden changes in light intensity, all inside said thermal envelope (a Styrofoam cooler, essentially.) They put a heating or cooling chemical packet in there with them, depending on the season, and then ship them overnight by FedEx or UPS. I unpack them immediately upon receipt, gradually acclimate them to the water and temperature they'll be livin
Re: (Score:2)
Very cool.
Re: (Score:2)
Is the CPU cooler able to keep up with the CPU doing that much work, or is the CPU forced to throttle back to prevent overheating.
Re: (Score:2)
You can pick up hex core boxes dirt cheap on eBay now. Look for the X5650 or W3680 Xeon boxes.
Re: (Score:1)
Both of which have such slow single threaded performance that most people are better off with an i3.
I have a W3680 box that runs 8 VDI boxes through citrix xendesktop and the performance is pretty good for that purpose but for gaming its 25-30% slower than my dual core i3 4160 in my cheapy shit gaming box
Re: (Score:2)
Interesting link. My Lenovo box has the an X58 motherboard but I don't think the stock bios will allow for much tweaking. This was still a huge speed boost from my old Q6600 box.
Re: (Score:1)
You think your CPU isn't good enough? I'm still using a Core 2 Duo here!
Re: (Score:2)
Video compression. I have tested faster speed CPUs with fewer cores against slower speed with more cores - more cores won. I'm VERY interested in this but suspect the 8 core overclocked might be the fiscally responsible way to go. I'd use a XEON but you cannot overclock them...
I do have an ESX server but I've never been able to find a good compression appliance to use. Something I could use with say a web front-end to upload and just settings for ffmpeg would rock. Anyone?
Re: (Score:2)
I would love a video conversion appliance. I want to pull movies off the Tivo and autoconvert them to MP4, as well as any movies in my collection in random video formats do the same. That would be very nice to have something that I could throw on the ESX box to do all that work with some minimal configuration.
Re: (Score:2)
Re:20 cores DOES matter (Score:5, Interesting)
I was under the (admittedly vague) impression that was true only if the thread was using floating point.
CPUs that offer more cores and/or threads than they do FPUs is one of the reasons I write a lot of my multi-threaded stuff (image and baseband RF processing) utilizing appropriately scaled integer math.
I have 8 cores with 8 FPUs on my desk, but many of my users are stuck with some of the wheezier I5 variants.
Re: (Score:1)
As a side benefit, "appropriately scaled integer math" can be better in other ways too, depending on the nature of your problem. It doesn't pile all its precision up right near zero, but gives you an even distribution of the precision across the range. That might (or might not) be a much better fit for your problem, depending on what you're doing.
Re: (Score:3)
Yep. Although I have to say, I really enjoy working with the domain 0.0-1.0; there are so many neat tricks that can be pulled. You can do them in integer too, but there is hoop-jumping involved.
Re: (Score:2)
I was under the (admittedly vague) impression that was true only if the thread was using floating point.
No, that is the AMD Bulldozer design. Hyper-threading provides no extra CPU power for the additional threads, how much performance you get out of the extra threads depend on how much the threads are forced to stall due to memory access, if they all do integer/fp instructions and memory access that hits the cache, you get only 50% normal performance out of each hyper thread.
Re: (Score:2)
Re: (Score:3)
Apparently the last time you checked was in 2004. Hyper threading gets you a lot more than a 10% gain on modern CPUs.
Re:20 cores DOES matter (Score:4, Interesting)
It depends on the task. For double precision FP calculations using MPI multi-processing (e.g. FORTRAN CFD), the extra overhead of the extra cores talking to each other mostly cancel out the gains.
For many many small short-lifetime processes you'll probably do better.
Re: (Score:2)
Use MPI BETWEEN nodes and OpenMP WITHIN a node. MPI is always going to be slower on a single node than threads are and for HPC type loads it can easily be 100x slower to use processes instead of threads due to communications overhead.
Re:20 cores DOES matter (Score:4, Informative)
You likely have not checked for a while. I saw figures of 120% performance ("each core at 60% performance" as you put it) back under the Pentium 4 HT, 140% under Nehalem/Sandy Bridge, and 150% under Haswell.
Re: (Score:2)
It is a complicated subject. Some tasks do not benefit from HT - those whose memory access fits entirely within cache, and who make use of operations that cannot be spread among execution units in a core (or where the pattern of operations is superscalar with a single thread).
Simultaneous multithreading (the non-trademark name for HT) offers benefits in certain situations. First, where the memory access pattern is unpredictable and/or uncachable - it essentially lets one thread keep the core working while t
Re:20 cores DOES matter (Score:5, Informative)
Hyperthreading on intel gives about a +30 to +50% performance improvement. So each core winds up being about 1.3 to 1.5 times the performance with two threads verses 1.0 with one. Quite significant. It depends on the type of load, of course.
The main reason for the improvement is of course due to one thread being able to make good use of execution units while the other thread is stalled on something (like memory or TLB, significant integer shifts, or dependent Integer or FPU multiply and divide operations).
-Matt
Re: (Score:2)
Re: (Score:2)
Workload and system balance, mostly.
If you look back several years (2008? earlier?) you'll see some Sun Sparc designs, and some IBM POWER designs, that supported 4 or 8 threads per core. They worked well for very specific workloads and applications.
The Sun Sparc designs with 8 threads per core were mostly tailored for "simple" highly-scalable web servers,
Re: (Score:2)
The down side of SMT is that you incre
Re: (Score:2)
Memory isn't neccesarily the bottleneck. (Score:2)
It's not that straight-forward. Depends on how much time the threads spend in the cache, and how much time they spend waiting on the FPU.
Re: (Score:2)
Re: Sounds like fun (Score:1)
Thank you for sharing. I was lost without your insightful comment!
Re: (Score:2)
Don't worry. Windows will bloat to fill the additional cores.
Re: (Score:2)
They did in 1990. Oh wait, that subject was run by an electrical engineering department so it may take a while for others to catch up.
Re: (Score:2)
you would think that they do since it's essential to making anything in java or making anything in android or making anything in just about everything except javascript nowadays.
anyhow.. our multi thread parallel programming course was in ADA.. it was in ADA so that we could use the language built in mechanisms and skip learning anything. I guess there was some rationale but it was probably the same rationale that was used for learning microcontroller programming with assembly for some hitachi pos.
Re: (Score:2)
"a whopping 25MB of L3 cache" souds like "a whopping 20Mb capacity" for a brand new hard disk in the early 1990's...
*Whooosh*