If you were comparing wide angle lenses, you'd be totally right. But in this case, you're really underestimating the additional complexity and elements needed to get that extra 100mm. This is where most of the size and weight come from here.
For comparison, the DA55-300 is fairly simple, with 12 elements in 9 groups. The 120-400? That's got 21 elements in 15 groups. Telephoto lenses are easier to make because the projected image has to be "shortened," not lengthened as in the case of wide-angle lenses. That's a bit easier to do but if the focal length increase is significant, then it becomes a much more complex process optically. Think of it this way: the focal length is the distance from the back of the lens to the focal plane when the lens is set to infinity. An extra 100mm means that image circle is about 4" further behind, or well beyond the back of the camera. That requires more elements to correct. Further, the magnification of 400mm vs. 300mm is significant and requires a much larger lens even for the same sensor size.
Nikon's 300mm f/2.8 lens is 10.5 in long and weighs 102.3 oz
Nikon's 400mm f/2.8 lens is 14.09 in long and weighs 134 oz (almost 2 pounds more!) and is 2" greater in front element diameter
Nikon's 300mm f/4 lens is 8.8in long and weighs about 50oz
Nikon's 500mm f/4 lens is 15.2in long and weighs over 109oz
So it's not just APS-C vs. FF. In this case, it's really about the focal length difference. It makes a
huge difference as you start to move into the super telephoto lenses.
For a more accurate comparison of focal lengths vs. formats, consider the
Pentax FA 100-300. This lens is fairly small and while larger than the DA55-300, it's only about 600g vs. the DA's 450g. There's a difference but it's not a huge world. The FA is not a terribly good lens, of course, but the DA isn't considered a pinnacle either. But this does show you that it's far more about the extra 100mm than anything else.
Now, you can argue that 300mm APS-C and 400mm FF are equivalent because of "reach" but this is a bit of a misnomer because at full frame utilization, you're getting 24MP on APS-C and 36MP on FF, a huge, huge difference in resolution. It's just hard to compare the two because the pixel densities are so different.