Why YouTube decided to make its own video chip

Thom Holwerda 2022-08-30 Google 26 Comments

Roughly seven years ago, Partha Ranganathan realized Moore’s law was dead. That was a pretty big problem for the Google engineering vice president: He had come to expect chip performance to double every 18 months without cost increases and had helped organize purchasing plans for the tens of billions of dollars Google spends on computing infrastructure each year around that idea.
But now Ranganathan was getting a chip twice as good every four years, and it looked like that gap was going to stretch out even further in the not-too-distant future.
So he and Google decided to do something about it. The company had already committed hundreds of millions of dollars to design its own custom chips for AI, called tensor processing units, or TPUs. Google has now launched more than four generations of the TPU, and the technology has given the company’s AI efforts a leg up over its rivals.

Google uses all kinds of custom hardware throughout its operations, but you rarely hear about it. This article provides some insight into the custom hardware Google uses for YouTube transcoding.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

26 Comments

2022-08-30 12:58 pm
sukru
There is a very common misconception about Moore’s Law:
https://en.wikipedia.org/wiki/Moore%27s_law#/media/File:Moore's_Law_Transistor_Count_1970-2020.png
It is not about the speed, but rather about the number of transistors on chips. And those keep doubling.
The side effect was technology getting cheaper at the same time. If you can fit 2x transistors in the same chip, you can sell the older chip for less, or make a much better chip at the same price.
The problem is, new chip printing processes are much more expensive, hence there are no more cost savings passed to consumers.
ThreadRipper for example asks $4,000+, while the originals were less than a quarter of that price. If they actually followed the side effects of Moore’s Law (more computing for the same price), we would all be visiting OSNews on 64-core machines right now.

2022-08-30 4:39 pm
Alfman verbose=1
sukru,
The problem is, new chip printing processes are much more expensive, hence there are no more cost savings passed to consumers.
ThreadRipper for example asks $4,000+, while the originals were less than a quarter of that price. If they actually followed the side effects of Moore’s Law (more computing for the same price), we would all be visiting OSNews on 64-core machines right now.
When thinking of that many cores, I always question the utility for normal consumers. (Relatively) few people need 16 cores, much less 64. In the past having more transistors paved the way for more registers, larger registers, longer pipelines, etc, which brought general purpose benefits. But there’s diminishing return for all of these things. More often than not transistors are used to implement extra cores and special purpose application accelerators that will spend most of the time going unused.
While I think it’s awesome to have more parallelism in hardware, most general purpose software is not well optimized for 64cores on a CPU. Very frequently when I complain about software performance bottlenecks in everyday desktop software (take the gimp, libreoffice, etc), I look at the CPU utilization only to see that one core is spiking at 100% while the others are almost idle. Doh. Games may be getting a little better about this, but still many just don’t scale with CPU cores because either they’re GPU bound or they use a highly single threaded game loop.
So until we solve the software parallelism problem, the benefit of massively parallel hardware will remain modest for the masses.

2022-08-30 6:18 pm
sukru
Alfman,
You are right about scalability of parallelism, especially for general purpose application code. Only specific applications will benefit from 64 AMD64 cores (video processing, compiling, simulations, …)
However we can now have different kind of cores. GPUs, Tensors, DSP, FPGA, and whatnot.
But then, of course, there will be a period it will not be highly utilized. APIs and standard will take years to reach maturity.
A Video Conference application would benefit a lot from Tensor cores. But is there a Web API for Tensor cores? (… checking …) turns out, it is not there yet, but there are early extensions: https://www.secondstate.io/articles/wasi-tensorflow/
Going back, I would also be okay with the other benefit: same performance at half the cost. But that is not happening either.
2022-08-31 12:19 pm
acobar
The bad news to us is that even single threaded implementations are not always optimized (and I have no idea what percentage of them are). Case in point, I looked up one popular algorithm to convert geodesic coordinates to UTM and inverse. To my surprise, sin, cos, sinh and cosinh were calculated 4 times more than needed on popular projects, now, take in account that mappings can contain millions of points and voilà!

2022-08-31 3:05 pm
Alfman verbose=1
acobar,
The bad news to us is that even single threaded implementations are not always optimized (and I have no idea what percentage of them are). Case in point, I looked up one popular algorithm to convert geodesic coordinates to UTM and inverse. To my surprise, sin, cos, sinh and cosinh were calculated 4 times more than needed on popular projects, now, take in account that mappings can contain millions of points and voilà!
I’ve grown up and accepted this reality too. Optimizing algorithms used to make the impossible possible, like with early 3d games. Now days most employers don’t want you wasting time with it. Software inefficiency makes me cringe sometimes, but when it’s not my call I just have to move on regardless of how I personally feel about the quality of code getting shipped.

2022-08-30 8:17 pm
ndrw
Moore’s law is a “law” of economics. Customers care about many different things: transistor density in memories, performance in CPU’s, power efficiency in mobile SoC’s. It just happens that several decades ago they all strongly corellated with each other.
ML is an observation that, when it comes to IC market, a fixed (additive) expenditure produces a multiplicative improvement in product attractiveness. Close the loop and you get an exponential growth.
Depending on what spec is important to you ML has ended (dramatically slowed down) somewhere between 20 and 5 years ago. While that’s bad for a broader market, it has fuelled quite a lot of investment and innovation in other areas than IC manufacturing.

2022-08-30 9:26 pm
sukru
ndrw,
I would normally argue Moore’s Law still holds in the datacenters. But clearly, for the case of YouTube it is struggling … Yet, they still managed to keep up with custom silicon.
(In case of YouTube, I think their needs are expanding more than any other platform. Not only there are more content creators, and more hours of video uploaded thanks to always available mobile devices, the video fidelity itself has sky rocketed. No longer are the 240p “web standard”, but we now expect 4K UHD from almost all videos).
2022-08-30 9:27 pm
Alfman verbose=1
After reading the summary, I was quite certain posters would be arguing the point about Moore’s Law, haha. It depends on who you ask… Some say it’s a measure of computing performance, others say it’s about transisters Either way, some say it’s already dead, others say it isn’t.
https://www.cnet.com/tech/computing/moores-law-is-dead-nvidias-ceo-jensen-huang-says-at-ces-2019/
https://www.techrepublic.com/article/moores-law-dead-in-2021-heres-what-the-next-revolution-will-mean/
Interestingly the techrepublic link says Moore’s law is dead despite using sukru’s definition. Furthermore they claim consumers will still benefit from lowering prices, unlike sukru… So I don’t think there’s going to be any real consensus on this one 🙂
Personally I didn’t think the growth in density could last forever.

2022-08-31 1:04 pm
acobar
Personally I didn’t think the growth in density could last forever.
It can’t. Physics will kick in soon o later. Regular electrical circuits, for now, are based on predictable results given by large charged particles quantities involved, even when the effect has a quantum root, like diodes (ok, actually, all are quantum effects applied to large quantities, aren’t statistics effects wonderful?). We can’t keep halving and expect quantum uncertainties will not show up, they will.
Perhaps, and this is what some suspect, we may have hybrid solutions in not so distant future, as there are benefits for some kind of algorithms to explore quantum behaviors.
Now, back to my cave, after repeat obvious and largely known things.

2022-08-31 2:38 pm
Xanady Asem
Quantum uncertainties/behavior have been at the core of semiconductor physics. The operation of the first transistor was literally based on it.
Every decade heralds the same assumption regarding how the scaling/manufacturability of semiconductors has hit an insurmountable wall. And there is always some subjective physics limit as a justification.
In the 60s there was an assumption that lithographic processes being limited to the optics/controls of the time. In the 70s there was an assumption that CMOs was a dead end and bipolar had to take over, which meant much reduced density and increased power was the only way to achieve operational increases. In the 80s there was an assumption that there was a frequency wall for CMOS around 66ish Mhz. In the early 90s there was an assumption that Ghz operation was only possible with “exotic” processes involving GaAs. In the late 90s there was an assumption that electromigration issues would make anything under 130nm impractical as that was the beginning of the limits for planar transistors. In the early 00s there was an assumption that runaway leakage would make things under 80nm impractical. In the late 00s we assumed that the EUV processes would be needed to resolve anything under 20nm and that they would be so expensive that they would make transistors exponentially more expensive not cheaper. Etc, etc, etc, etc.
2022-08-31 4:57 pm
Alfman verbose=1
javiercero1,
Every decade heralds the same assumption regarding how the scaling/manufacturability of semiconductors has hit an insurmountable wall. And there is always some subjective physics limit as a justification.
There does seem to be more industry consensus that the physics barriers are finally upon us though with little to gain ahead. The engineering focus is shifting to making processors that are physically bigger by improving yields rather than improving node technology itself. It may finally be the end of the line for shrinking transistors.

2022-08-30 10:00 pm
Xanady Asem
ML is a law of cost per transistor decline, which has held at somewhat constant rate of 1%ish reduction per week for 5+ decades.
ML hasn’t ended whatsoever. It’s just that people mistake Moore’s Law with the pace of lithography advances, when a big chunk of ML scaling, for example, comes from old processes recouping costs and becoming profitable.
2022-08-30 10:35 pm
sukru
You are all making me invest in researching this 🙂
Back to Wikipedia:
Several measures of digital technology are improving at exponential rates related to Moore’s law, including the size, cost, density, and speed of components. Moore wrote only about the density of components, “a component being a transistor, resistor, diode or capacitor”,[121] at minimum cost.
So, yes, there are alternative formulation of the law, but the original is still: Density at minimum cost per transistor
That is a bit interesting. Both the original paper and the article mentions yields. Chip yields depends on many factors, but one important component is the size. As you build larger chips, the probability of it being affected by defects increases. (If there is a 1% chance of a defect in each 1cmx1cm area, the 4cm x 4cm chip will be 85% defect free, whereas 8cm x 8cm chip will only be 52% defect free. In other words on average a 128 cm2 area will give either one 8×8 or 6 4×4 chips, which means ~50% drop in usable yields).
Anyway, by packing more transistors per chip you reduce the average cost.
And by using more area per chip you reduce good yield ratios.
And there is a sweet spot in between.
At this point we probably need an cost per transistor graph, which I was unable to find. This was the closest: http://www.singularity.com/charts/page62.html

2022-08-30 11:39 pm
Xanady Asem
Moore’s Law was an aggregate trend for a whole industry. So you have to account all processes active per annum, not just the latest ones to come on line.
BTW. There’s a time component to yield too; i.e. the longer the process matures the higher the yield as well.
On a single year you have the entire range of components; .That whole spectrum averages out into an effective reduction in transistor cost. Exponential complexity design/verification costs keep creeping in though.
https://www.eejournal.com/wp-content/uploads/2018/10/Fig-3-DARPA-Design-Cost-Plot.jpg
2022-08-31 12:44 am
Alfman verbose=1
This graph shows transistor density (transistors per square mm).
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256245
This graph shows transistor count
https://twitter.com/future_timeline/status/1506378798157156355/photo/1
There are significant differences between these two projections in terms of what it means for the future. If we care about density, then moores law does appear to be dying.. But if we don’t care about density, then we can just use bigger dies to add more transistors going forward. If we want 2X more transistors, we can cut dies to be twice the area without dramatic improvements in the fab technology.
Obviously there are benefits associated with increasing transistors by doubling the density as opposed to die dimentions. If we are really reaching the end of the line for density improvements, then future chips that double the number of transistors are going to have to do it by physically increasing die dimensions instead. While physically achievable, we won’t get the efficiency improvements and the larger distances between components has cons like increasing bus latency between them.

2022-08-31 4:10 am
sukru
Alfman,
We are already seeing much larger chips. Threadripper is one such design, so are Apple’s M1 Pro Ultras.
https://youtu.be/9idMb1DRhg0?t=59
However once again, larger chips means more failure. To counter this, CPUs have moved to “chiplets” (multiple chips “glued” together, for lack of better term).
Anyway, looking back:
https://en.wikipedia.org/wiki/Intel_Galileo
https://www.vintagecomputing.com/index.php/archives/1069/retro-scan-of-the-week-the-micron-millennia
Intel’s Galileo came with the Quark chip, which is a modern clone of the original Pentium at a much lower price. $70 vs $2,500 (even without inflation adjustment).
Bottom line, we seem to see:
– Denser lithography = more transistors in the same area
– Larger chips
– Improved process = better yields
– Increased clock speeds (but we have hit a soft limit)
– Cheaper production of older generation designs
All of which were helping more computing for the same price (or same computing for half the price).
Anyway, this was educational.
2022-08-31 8:41 am
Alfman verbose=1
sukru,
We are already seeing much larger chips. Threadripper is one such design, so are Apple’s M1 Pro Ultras.
…
Anyway, this was educational.
Yes. I agree with your summary from before. This seems more suited to data centers than consumer applications.
Clearly everyone benefits from improved node technology, but once density and node technology stalls, then the kind of scaling going forward appears less desirable for consumers contending with battery life, fan noises, and limited application support for such high core counts. For a datacenter, even without node improvements there can still be benefits to packing dies with more and more cores because it enables them to consolidate servers. In short, so long as power & cooling are viable, there will always be demand for massive CPUs from data centers even without future node improvements.
2022-08-31 2:27 pm
Xanady Asem
Datacenter applications are actually more sensitive than consumer applications to die scaling issues.
Space, cooling, and power budgets take over half of the overall cost of operation. Inter die scaling (stacking, chiplets, POP, etc) does not necessarily affect those cost dimensions as positively when compared to intra die scaling.
2022-08-31 4:26 pm
Alfman verbose=1
javiercero1,
Datacenter applications are actually more sensitive than consumer applications to die scaling issues.
Space, cooling, and power budgets take over half of the overall cost of operation
Well, the thing is the higher space, cooling, and power budgets of larger dies need to be compared against the servers that the data center would otherwise be using. It’s not additional cost for them, only distributing the costs across fewer servers, which can be beneficial even absent any node improvements.
Inter die scaling (stacking, chiplets, POP, etc) does not necessarily affect those cost dimensions as positively when compared to intra die scaling.
I agree there are still improvements to be made on yields, even if they can’t improve node density. Improving yields reduces waste, which should reduce costs. This link puts AMD’s yields at 70% a few years ago, I’d be curious if it’s changed since.
https://www.pcgamesn.com/amd/zen-2-ryzen-3000-cpu-yield-70-percent
Building in the z-dimention doesn’t cost much space, but I imagine that heat is the limiting factor due to cooling being proportional to surface area,

2022-08-31 5:10 pm
bassbeast
Well I also think going from the MHz wars to the core wars? Kinda caused machines to quickly get so damn beefy Joe and Sally Normal quickly ended up with machines that will do what they want it to do until they die.
I mean I used to be a HARDCORE gear head, I used to be constantly replacing my hardware almost religiously every other year, but here is the thing…my R5 3600 isn’t even being stressed as it is, why would I replace it? The same goes for the wife with her R3 1200 or even the granddaughter and her FX-8320, for the tasks they have like the Wife’s World Of warships or the granddaughter’s Minecraft neither system is breaking a sweat, so why buy new systems?
With the MHz wars it was easy to tell you needed a new system as it was simple to “feel” for lack of a better word single core performance gains…that hasn’t been that way for quite a while now. Heck I even recently rescued an FX-6300 out of the parts bin for a grand kid whose Xbox died and paired with 16GB of RAM, a cheap SSD and an R9 380x I had lying around? He is playing his MMOs at 60FPS plus at 1080P and that is literally a decade old chip.
So even if they were to somehow get the prices back to the first and second gen unless you are doing some SERIOUS workloads I think the new chips would be extreme overkill. I mean I’m sure one day I’ll throw in a 5950x into this rig just because I find one at a decent price, but will I actually NEED that level of performance? Nope it’ll probably spend most of its time twiddling its thumbs.

2022-08-31 6:08 pm
Alfman verbose=1
bassbeast,
Well I also think going from the MHz wars to the core wars? Kinda caused machines to quickly get so damn beefy Joe and Sally Normal quickly ended up with machines that will do what they want it to do until they die.
I mean I’m sure one day I’ll throw in a 5950x into this rig just because I find one at a decent price, but will I actually NEED that level of performance? Nope it’ll probably spend most of its time twiddling its thumbs.
+1
Most desktop users aren’t saturating CPUs these days. Even if you’re into gaming, games are typically more taxing for the GPU than CPU. Obviously there’s always going to be some applications for which the performance and cores are appreciated, but all those cores have marginal impact for the average user.
I actually think there would be more value for faster CPU cores (instead of more of them). Many productivity applications remain stubbornly single threaded. Alas, the future of CPUs is in core counts rather than core speeds since we’ve already picked the low hanging fruit and more cores are easier to deliver whether consumers need them or not.

2022-08-31 9:24 pm
sukru
Alfman,
You can go faster, only after you go smaller.
I think this will be a repeat, but speed of light, or rather, speed of electricity is the limiting factor. At 1 GHz light can travel only 30cm. At 5 GHz electricity will move around 3cm only.
Moving to 10 GHz, for example would require much smaller chips, and/or faster transfer mediums. Smaller chips can of course can be done with a simpler design, too (ARM?)
And, that was the reason we cut up chips into multiple cores in the first place.

2022-08-31 10:57 pm
Alfman verbose=1
sukru,
You can go faster, only after you go smaller.
I think this will be a repeat, but speed of light, or rather, speed of electricity is the limiting factor. At 1 GHz light can travel only 30cm. At 5 GHz electricity will move around 3cm only.
…
And, that was the reason we cut up chips into multiple cores in the first place.
Yes, I know. That’s why chipmakers are giving us more cores even though many of us would prefer giving up cores for faster ones.
Moving to 10 GHz, for example would require much smaller chips, and/or faster transfer mediums. Smaller chips can of course can be done with a simpler design, too (ARM?)
A simpler ISA design could help overcome frequencies that are limited by propagation delay, but I think we could get even more dramatic results if we’d be willing to completely throw away CPU designs that depend on centralized registers, data busses, and core-wide synchronization.
We could build CPUs to resemble linear assembly lines with totally independent worker units. Instead of saving results back to a central register file, it could just be passed forward to the next worker, which is immediately adjacent to the previous worker. This design would intentionally avoid core-wide data busses and global registers. All information would be handed off quickly without waiting on global synchronization.
Independent workers provides opportunities to increase clocks speeds since there’s no need to wait for signals to propagate across the core outside of a worker’s domain.
This would be a boon for achieving higher frequencies, but it’s a radical departure from what software developers are used to. It would probably feel more like programming an FPGA. I would love to play with something like this, but we still live in an industry that is notoriously resistant to change.
In practice, the lack of native software & tools taking advantage of the design would be a major problem. It would be bad to judge this kind of CPU by how well it emulates existing software, but that’s very likely what would happen, like the itanium.
2022-09-01 3:51 am
sukru
Alfman,
Yes, such an architecture would be interesting. However making it generic at the same time is going to be a challenge.
The number of stages are fixed, and one would not be able to go backwards easily.
Yet, there were similar designs in early GPUs: They were basically built around a single graphics API (OpenGL / Direct3D) with predefined stages:
https://en.wikipedia.org/wiki/Fixed-function
2022-09-01 4:35 am
Alfman verbose=1
sukru,
Yes, such an architecture would be interesting. However making it generic at the same time is going to be a challenge.
The number of stages are fixed, and one would not be able to go backwards easily.
You could have more connected typologies (such as hypercubes used by some clusters, etc) that would be able to loop back (and fast forward) through the network to align with algorithmic requirements. The goal though would be to minimize choke points and maximize worker localization. In many ways this is similar to parallel computing cluster models, but I was thinking these workers would be far simpler than a typical x86 cluster node.
Yet, there were similar designs in early GPUs: They were basically built around a single graphics API (OpenGL / Direct3D) with predefined stages:
https://en.wikipedia.org/wiki/Fixed-function
Those fixed function pipelines were essentially hard coded from beginning to end. But I liken it more to an FPGA with functional ALU units that are available for use by workers, yet not hard coded. Such pipelines could be programmable.
It’s interesting to think about what software programming languages for these would look like. Technically you could program each worker individually, but IMHO that is an antiquated approach. We would need high level languages compilers that can build & optimize the network used to program the hardware. All of these changes would be very unpopular with the software industry, but I think we can all agree that regardless of what the industry wants, locality is the key to compensating for hardware’s physical limits on frequency and node sizes.

2022-09-02 1:54 am
chericrystal
This would be a boon for achieving higher frequencies, but it’s a radical departure from what software developers are used to. It would probably feel more like programming an FPGA. I would love to play with something like this, but we still live in an industry that is notoriously resistant to change.