Back Prop Doubt

I'm having a problem implementing gradient descent for backward propagation. Coding up a simple neural network from scratch. For example, I've considered this neural network architecture below.

Now since gradient descent calculates the derivative of each weight, bias in each layer, I tried to work out the math by hand. But my calculations don't seem to agree with the practical implementation. Here is how I've represented the network :

The first two neurons are just inputs x1, x2. The neurons in 2nd & 3rd layer use ReLU and sigmoid activation function respectively since this is a binary classification problem. The superscript represents which layer that element belongs to. And I'm trying to calculate the derivative of the Cost function with respect to the weight associated with the bold link i.e w^[1]_(1,2).

w^[1]_(1,2) ⇒ The superscript corresponds to layer 1, the subscript (1,2) corresponds to 1st neuron of that layer, and the 2nd weight associated with that neuron i.e

(which neuron, which weight) ⇒ (1,2).

I'm trying to find out the partial derivative of cost function with respect to w^[1]_(1,2) i.e one particular weight just to see how things work. The cost function I'm using is the cross entropy loss function :

$$ C = -\sum \left ( y\log x + (1-y)\log (1-x) \right ) $$

Here are my calculations :

W^[1] is a matrix that contains all the weights of the 1st layer, its shape is (2,3) since each neuron has 2 weights associated with it and there are 3 neurons in the 1st layer. Since the final equation just consists of element wise multiplication, we end up with an array of shape (1,m) where m is the size of training examples.

In order to update W^[1], each element of dC_dW^[1] must also be a float, but it turns out that each element of dC_dW^[1] is an (1,m) array. How is the backward propagation equation gonna work since W^[1] will be a (2,3) matrix of floats whereas the derivative of that is a (2,3) matrix constituting (1,m) arrays. How can (1,m) arrays be subtracted from floats??

And here are my calculations for the derivation of the chain rule used in calculating dC_dW^[1]:

I forgot to divide the Cost function by m, but that wouldn't change the results.

And I've also tried calculating the derivatives of other weights and biases and I end up at the same conclusion. The derivative matrix consists of (1,m) arrays as elements whereas it should contain floats in order to update the parameters.

EDIT on 28 July :