In big data clusters, task dispatchers assign arriving tasks to one of many workers (servers) for load balancing. Workers schedule task executions for rapidly completing queueing tasks. Both dispatchers and workers are important for optimizing task/job-completion-time (TCT/JCT). Current dispatchers probe loads on workers before assigning every task/job, which incurs expensive message overheads and significant delays. Besides, they use simple First-In-First-Out (FIFO) scheduling on workers, which further harms their TCT/JCT performance due to head-of-line blocking. In our TASCO scheduler, workers report their loads to dispatchers so that dispatchers avoid to probe them, which significantly reduces expensive overheads and delays on current dispatchers. Motivated by recent observations that more than 60: tasks in big data clusters are recurring with predictable task service time, we also use delay-optimal smallest-task-first (STF) scheduling to improve current simple FIFO scheduling on workers. We also derive the average TCT of TASCO based on its equivalence to an M/G/1/STF queue and the insight that workers reporting loads to dispatchers follows a Poisson process in the large-system limit. Our theories and simulation results demonstrate that the average TCT/JCT of TASCO outperforms state-of-art schedulers from 5.3: to 55.9:.