A distributed, shared-memory multiprocessor architecture based on a hierarchical bus system is proposed. The architecture consists of processors and memory modules interconnected by a hierarchy of buses and controllers. The system forms a tree structure with processors at the leaves of the tree. Main memory is distributed among the groups of processors. Each bottom level bus forms a multiprocessor cluster. The cluster architecture exploits the locality property(cluster locality) of parallel programs and the efficiency of the cache coherency. Each cluster is connected to a higher level bus through a controller which controls propagation of bus operations that are not related. The controller has a bit directory only for memory blocks instead of cluster cache. The bit represents a sharing state which indicates the local block cached remotely and also the remote block cached locally. The controller controls a propagation of the bus operation through the content of the bit directory. The controller provides the cache-to-cache transfers between clusters through the bit directory. Therefore the set of processor caches acts as a cluster cache. A high performance microprocessor recently developed with large processor cache can be used as a processing element without a overhead of the cluster cache by the Multilevel Inclusion Property. It is shown that this architecture provides multi-cache coherency and efficient synchronization, and also the traffic of each level of bus is well balanced. Results of simulations are presented, which show that a large scale multiprocessor can be constructed using this architecture which will achieve a substantial fraction of its peak performance.