https://www.anandtech.com/show/6971/exp ... processors
"We start with the Cortex A9.
Cortex A9 achieves throughput of 1 operation/cycle for most scalar instructions, except for fp64 MUL and fp64 MAC, which can only be issued once every two cycles. The mixed test reveals that though fp64 muls can only be issued every two cycles, Cortex A9 can issue a fp64 add in the otherwise empty pipeline slot. Thus, in the mixed test it was able to achieve throughput of 1 instruction/cycle. NEON implementation in Cortex A9 has a 64-bit datapath and all NEON instructions take 2 cycles. Qualcomm's Scorpion implementation of scalar implementations is similar to Cortex A9 except that it seems unable to issue fp64 adds immediately after fp64 muls in the mixed test. Scorpion uses a full 128-bit datapath for NEON and has twice the throughput of Cortex A9."
"In the big picture, readers may want to know how the the floating point capabilities of these cores compares to x86 cores. I consider Intel's Ivy Bridge and Haswell as datapoints for big x86 cores, and AMD Jaguar as a datapoint for a small x86 core. For double-precision (fp64), current ARM cores appear to be limited to 2 flops/cycle for FMAC-heavy workloads and 1 flops/cycle for non-FMAC workloads.
Ivy Bridge can have a throughput of up to 8 flops/cycle and Haswell can do 16 flops/cycle with AVX2 instructions. Jaguar can execute up to 3 flops/cycle.
For fp32, Ivy Bridge can execute up to 16 fp32 flops/cycle, Haswell can do up to 32 fp32 flops/cycle and AMD's Jaguar can perform 8 fp32 flops/cycle. Thus, current ARM cores are noticeably behind in this case. Apart from the usual reasons (power and area constraints, very client focused designs), current ARM cores also particularly lag behind in this case because currently NEON does not have vector instructions for fp64. ARMv8 ISA adds fp64 vector instructions and high performance implementations of the ISA such as Cortex A57 should begin to reduce the gap."
According to what they linked, the Cortex A9 is not great at floating point, even with an FPU. So the issue appears to be that the FPU is a bit underpowered even if it's present, not that it isn't present at all. For instance, my old Ivy Bridge desktop from 2012 can do 32 single-precision and 8 double-precision floats in a single cycle, while an ARM processor would struggle to do 1 floating point instruction per cycle. And that's with a Cortex A9, never mind the A7.
"The Athenians, however, represent the unity of these opposites; in them, mind or spirit has emerged from the Theban subjectivity without losing itself in the Spartan objectivity of ethical life. With the Athenians, the rights of the State and of the individual found as perfect a union as was possible at all at the level of the Greek spirit." -- Hegel's philosophy of Mind