A Founding Father of Modern Computing: Part 2
A revised transcript from Part 2 of our interview with Steve Casselman
Colby: I also wanted to talk about your work at DRC too, because that seemed like it was a big part of what you did early on.
Steve: Yeah. That was another portion of everything. I started DRC with a guy named Larry Laurich who had been, well let me backup a little bit.
So I'm in this book written by Dr. Fred Haney, because he was on my board at VCC and his thesis basically is you gotta get somebody who's really fungible, to get money. If you get somebody who's run a hundred startups and made money on every one, then you're gonna get money.
So he went out and got this guy for me Larry Laurich, who had been the general manager at tandem computer. And he had 1500 people working for him. So together, he and I started up DRC computer. That was an interesting thing because I had these patents about putting FPGAs together with CPUs.
And then the Opteron came out and Opteron was the first processor that really had a multiprocessor mentality. So every workstation really was set with two opterons right. And they talked over this open source bus.
So when that came out, I immediately knew that this was the perfect thing for me because I could take the processor out and put in a FPGA and then I would have a tightly coupled system where the CPU is a world class CPU and the FPGA is right there next to it, over a high speed interconnect.
So that was very important. So the minute I saw that, I was like, oh my God, this is gonna happen. We could do this. And I got hooked up with Larry and he helped raise our first money. And it was very interesting because I was in Reseda, which is in the San Fernando valley and then I had to move up to Silicon valley.
So we were up there, and when I came up with it down in Southern California, I was like, I went to the AMD and said check this out, I could go right into your socket. We could do this. This would be great. And they were like are you okay?
So then I went up to Silicon valley and we actually made one, so we made this one fit in the socket. We called in AMD. This guy comes in and he goes, oh you're not the only smart person. And then we started talking, then he said, you're not the only smart guy. There's other people that have been thinking about this. I go, did they live in Reseda and he goes, what? And he gets out his PC and he goes, oh, that was you . Yeah, I was the really smart guy in Reseda.
But, we made the thing and it actually worked and I got it to boot myself. I did all the work to get it to. And that was really hard. I had never worked so hard in my life. I worked 20 hour days, slept on the floor, did that for months and months.
Colby: And that's crazy, because I was looking and you were CTO there and usually the CTO kind of gets to take a break from some of the technical stuff, but you just kept working on the technology.
Steve: Yeah, we had several PhDs working for us, but there would come a point when they couldn't do it, they couldn't figure it out. And so then it was my job to do it. And I did, and there were other things just like having to reboot in the Opteron socket, and these guys said, oh, it can't be done.
But I figured it out, I got a patent on it. Everybody's going, oh, we can't reboot in the socket, I go, it's a reconfigurable computer, you have to. You don't understand, there's no options here, right? And you can't convey that kind of urgency and stuff to somebody and say, imagine you're on fire, you have that long to make this thing work.
So I got it to work and that was really great. And then the CEO went to Cray and said, check this. And they came down and it was a big hit with them because basically you could, you didn't have to change your design to put it in an FPGA thing.
You just put this thing in the processor socket, right? Yeah. So we sold some stuff with them, still, all the stuff is still so hard. It's getting easier because of some of this open source stuff and a lot of stuff going on, but I really think there's a way to kick FPGAs up to the processor style, ease of use.
And that's all doable. And some of it's been done before, so eventually whatever I'm doing, I'm going to head toward that point. It's gotta be so that the guy, whoever's programming, just programs. They don't care. There's only one thing, you have to think about pipelining stuff and whether your code would be pipeline-able.
And it turns out that's actually good for regular computers. So somebody gave a paper at a supercomputer conference where they go, oh, we decided we would pipeline these algorithms and do them in little batches, and we got a 40% increase.
So that has always been like my 20-20 vision, if you could get some language where somebody could, on a regular computer, get 20% more. Then, if you went to the FPGA, you would get 20 times more then nobody would argue with that ever, right?
Colby: Obviously right now, you're doing an awesome job explaining this technology that most people don't understand. Was that a difficult part, try to actually sell it because you know it works and you know how it works, but to get those ideas out.
Steve: Yeah I'm not the best sales guy because I look at something, I have a vision of how it works, I understand how it works, and it's at a gut level. I could tell you how it works, but it might take a couple years to get you there. And that means I'm not all that good at this, so that's one of the things that I'm trying to get better at, to be able to explain what my vision is and how it comes across.
But the problem now is it's still many years ahead, and venture capital guys don't understand this. I'm not all that fungible because I was never in a big company and led a whole team and did all that kind of stuff.
Colby: How did DRC come to a close cause I saw a couple articles and everything. And you mentioned the algorithm and everything like that?
Steve: That was one of my goals, as the CTO there, was to show that we could do something better than anybody else in the world. Something, just something. So that ended up being the Smith Waterman algorithm, which is used to find an alignment in the DNA sequences.
So that project started with, I had a really smart guy working for me and he, who eventually got his PhD at Stanford, but he would get done with something he'd be playing around and I go hey, whenever you're in that position, just look at this stuff and do this stuff. And he goes, okay.
So eventually we got all that to work and it was pretty awesome. And it was just one of those things that I had to do behind the scenes. There, there was also stuff that I pushed people to do, when you're not doing something, look at this. You'd like it, it's interesting. And that got a lot of things going
Colby: I like to ask ASIC designers and FPGA, how it was learning everything. Because when you used to have to draw the circuits, I can't even imagine doing that or teaching it. How were curriculums, did you learn most of your stuff at college as opposed to at work?
Steve: I got a degree in math. So I went to UCLA and I was gonna be a EE, I was gonna be a computer science and math double major. So I went to the computer science stuff and they had this weird language that's dead now. And they wanted you to do that and I'm going, I'm not doing that.
And I figured, what I will do is I'll get my degree in math. And then when I get a job, that'll come in handy. And if I sit in front of a computer all day, I'll just learn about the computer, whatever it is, I'll learn that system and it'll be something I can do. And, and so I sat down and the first system was really the Daisy systems.
So I learned all about CAE, how to, actually they had a Pascal behavioral language I learned and I learned spice and I learned to do all sorts of stuff on my own, in the meantime, while I was putting in other people's schematics. So for me, that's how I got into computer science was through mathematics, which has helped a lot because in reconfigurable computing, what happens is that people have very hard problems to solve.
And usually these problems involve math. And so if you can understand the math, you can pretty much understand the problem, and if you're good, you can understand how that maps into hardware, right? So that turned out good. And I didn't regret that.
Getting a computer degree back when I was, in the early eighties, there just weren't that many computers and they weren't that much fun. So I was in the CAE lab and the mandelbrot came out in the Scientific American. And we had just gotten one PC that had a color monitor and it was a really slow IBM. And so I would write programs to do the mandelbrot and then I would leave, and I was working the swing shift so I had a lot of time. I put the mandelbrot on there and I leave and then come back and everybody would go, oh, that was really cool. But we had to reboot it to work, so I never got to see the many ones I generated. That was fun because I would do every 10th dot or every hundredth dot.
And if I had a lot of colors, then I'd shoot it off and it would take all night. I wouldn't get to see it, but it was fun. So that's why I did the mandelbrot on the FPGA. So I'm the first guy to do that. So the picture that you see in the background over there, that's all done on an FPGA.
It was so funny because at a demo night, what they call demo night. I was showing off the mandelbrot and some guy, who was like six six or something, comes up and he's looking down at me and goes what do you got? And I go, I got 24 bit multipliers.
And, they were very small multipliers because I used Booth's algorithm, but that was really, it was running faster than the sun workstation. So he goes, okay. And he comes back and he comes back with his friend who was like five six, and there's those two weird guys, and they're just going, and the big guy's hitting the little guy saying, see, I told you they could do it. See, I told you! And I got some email from a Bob Smith. And I replied once I tried to reply again and it was gone. They turned out to be NSA guys. Very interesting.
So in the very early times, I was trying to get things going, so I called up the founder of Xilinx, Ross Freeman. And I said, here's what I'm doing, and I wanna work for you. And he didn't think I had a position there, but he said to me, you have to talk to these guys in the government, and tell them what you're doing. And I go, okay.
So it was The Institute for Defense Analysis. I called them up, talked to a guy and he goes, can I see your proposal? I go, can I get an NDA? He goes, no, whatever comes in here, doesn't leave. And I go, okay. So I sent it in. And then of course later on that same group came out with one of the first reconfigurable computers.
So they went from their splash one, which was loaded serially and they were trying to make an ASIC that they could search everything on the internet. But then I gave them my paper and the next thing, they came out with the slash two, which was a real reconfigurable computer.
And their demonstration was actually Smith Waterman. Because in a Smith Waterman, you're looking for a string in DNA, but if you make it a little bit bigger, you can look for words in streaming paragraphs and it could stream as fast as the internet could go at that time.
Colby: And the last thing I wanted to touch on was your current project that we first connected on. Is that the smart nic one?
Steve: What I'm working on right now, my business plan is a smart nic, plus computational storage . So my thesis is basically that computers have evolved silos of optimization. So the CPU is optimized. The network card is optimized. And the storage, it's all optimized for their areas. And the only way now that you're gonna get more performance, is if you can bang two of those together and optimize over the boundaries, right? So these silos have clear boundaries, and if you can push them together, then you're gonna get something a lot faster. So if you take an FPGA and you put networking on it and you put storage on it, and then you put it into the storage card format, you can actually put 32 of these things in a two or Three-U cabinet.
Not only do you have a lot of storage, but you have a lot of networking into the storage. So it's directly networking into the storage. So you got a huge pipe into your storage. It's huge. It's way faster than anything that's out there. They can't do what this card can do, anywhere.
Although, what I hear rumors of, is they're trying to do that. They're trying to think how to do that. So one of the companies has their data engine or something, and now they're trying to do that NVMe over fabric. And then as soon as they start thinking about that, it's gonna get to the point where like, why don't we just put the fabric right next to it.
But these data processors that are coming out, they suck too much heat. So that's why FPGAs are great for the job. I can fit it in a 75 Watt envelope and I can get a lot of work out of it and I can manage how much power I take and all that kind of stuff. And I think, those will only get bigger, more powerful, storage wise.
If they get to 150 Watts, then it's gonna be obvious to everybody. But I've put in a patent and I'm trying to work to get that funded. It's a little more difficult because it's a board, but my bootstrap plan is to work on this algorithmic state machine. So it's really made to be the back end of a high level HDL, kind of a high level language compiler.
The state machine is very small. It's very efficient. And you could have, instead of one RISC V, you could have 20 of these things, all running concurrently. And I just think that right now, we've gone a little too fast down the path of HLD to hardware, not hardware description, but HLD, high level language design. We've gone too far down that path too quickly, and there's a data plane and a control plane.
And right now, the control plane we're using is one hot state machine. They're not reloadable, they're not reusable. They're very hardwired and they're not really efficient and they're not, they're just a conglomeration of one hot state. There hasn't really been a breakthrough in low level state machine work since 1970, basically. Was the last time, right?
So this is a whole new way to do state machines. And it's very fast and you program it in C, you can debug it in GCC. You can reload it at run time, so the idea is that you have one algorithm and you got all the hardware data path for that, and you put it down and then you hook it up to one or more of these state machines.
And then you take another algorithm and you try to overlay that on top of the first algorithm and you do resource sharing, and then you run those extra lines of control into one of these state machines. Then what happens is that, to go from one algorithm to the next, you just reload the state machine, which is extremely fast. We're talking less than a thousand bites of stuff can completely change the behavior of your algorithm.
And so nothing's for free so both of those algorithms will go a little bit slower. But if you look at Amdahl's law, that's nothing compared to the overall gain you'll get by putting more software into hardware.
So that's the thing that you need to do, overlay these circuits. And then my general thesis on that is that in an area, you'll start doing that and then you'll get to the point where you don't have to add any more mux's or anything. And the next algorithm that comes in, you just have to generate microcode bits for it. And swap those in and you'll get like a kind of general purpose, ugly looking thing that could probably help with AI and all the stuff that's going on now for it to be efficient.
And that's, like I said, the key problem now is changing the behavior of the FPGA. You have to change the behavior of what it does in an extremely short amount of time, to keep up with the GPUs and the CPUs, if you really wanna do it right.