1. Introduction
In large-scale Internet applications, as the amount of data continues to grow, the single-database, single-table architecture cannot meet the needs of high concurrency and big data storage.Sub-library and sub-tableIt is a common database architecture optimization solution that can improve database throughput, reduce the amount of data in a single table, and improve query efficiency. However, sharding also brings many complex problems, such asPrimary key generation, paging query, distributed transaction, cross-table query, high availabilitywait.
This article will explain in detail the core concepts, key technologies and implementation methods of sharding, and combine sample code to help developers understand how to apply sharding in actual projects.
2. Overview of sharding
2.1 What is sharding?
Database and table sharding is a technology that splits the data originally stored in a database table into multiple databases or multiple tables. There are mainly two ways:
- Sharding: According to a field (such as
user_id
) to split the data into different database instances and tables. For example:- User ID
1~1000
existdb1.user_table_1
,1001~2000
existdb1.user_table_2
- User ID
2001~3000
existdb2.user_table_1
,3001~4000
existdb2.user_table_2
- User ID
- Vertical Partitioning: Split according to business modules, such as
User Table
andOrder Form
Stored in different databases, suitable for data isolation of different business systems.
2.2 Why use sharding?
- Solve the performance bottleneck of a single database:A single database cannot handle a large number of concurrent requests, and splitting the database can balance the traffic pressure.
- Improve reading and writing efficiency: Reduce the amount of data in a single table and improve query and insertion speed.
- Reduce storage costs: You can use multiple lower-configuration database instances to replace a high-configuration database.
- Improve system high availability: Multiple database instances can be independent of each other to avoid single point of failure.
3. Generate primary keys for sub-databases and sub-tables
After the database and table are divided, the generation of a globally unique primary key becomes a key issue. Common primary key generation methods include:
- UUID: Globally unique, but long, with high storage overhead, and not conducive to index query.
- Database auto-increment ID: Available on a single machine, but cannot be unique across databases.
Snowflake:
- The distributed ID generation algorithm proposed by Twitter is based on timestamp + machine ID + serial number
Go implementation example:
package main import ( "fmt" "sync" "time" ) const ( workerIDBits = 5 datacenterIDBits = 5 sequenceBits = 12 maxWorkerID = -1 ^ (-1 << workerIDBits) maxDatacenterID = -1 ^ (-1 << datacenterIDBits) maxSequence = -1 ^ (-1 << sequenceBits) ) type Snowflake struct { mu sync.Mutex lastTimestamp int64 workerID int64 datacenterID int64 sequence int64 } func NewSnowflake(workerID, datacenterID int64) *Snowflake { return &Snowflake{ workerID: workerID, datacenterID: datacenterID, } } func (s *Snowflake) NextID() int64 { s.mu.Lock() defer s.mu.Unlock() timestamp := time.Now().UnixNano() / 1e6 if timestamp == s.lastTimestamp { s.sequence = (s.sequence + 1) & maxSequence if s.sequence == 0 { for timestamp <= s.lastTimestamp { timestamp = time.Now().UnixNano() / 1e6 } } } else { s.sequence = 0 } s.lastTimestamp = timestamp return ((timestamp << (workerIDBits + datacenterIDBits + sequenceBits)) | (s.datacenterID << (workerIDBits + sequenceBits)) | (s.workerID << sequenceBits) | s.sequence) } func main() { sf := NewSnowflake(1, 1) fmt.Println(sf.NextID()) }
- Database segment number segment method:
- Preallocate ID segments, such as
db1
generate1~10000
,db2
generate10001~20000
, which is suitable for a small number of databases.
- Preallocate ID segments, such as
4. Paginated query by database, table, and page
Paging query becomes complicated after sharding, because the data is scattered in multiple tables and databases. Common solutions are:
- Full database query + merge sort:
SELECT * FROM ( SELECT id, name, create_time FROM db1.user_table_1 UNION ALL SELECT id, name, create_time FROM db1.user_table_2 ) AS temp ORDER BY create_time DESC LIMIT 10 OFFSET 0;
- Leverage caching: First query the primary key ID, then query the detailed data in the sub-database to reduce the overhead of cross-database queries.
5. Distributed transactions after sharding
The core issue of distributed transactions is transaction consistency between multiple database instances. Common solutions are:
- TCC (Try-Confirm-Cancel) mode
- Local message table + reliable message queue
- Seata distributed transaction framework
Example: Use golang
+ gorm
Solving distributed transaction problems:
func createOrder(db1, db2 *gorm.DB) error {
return db1.Transaction(func(tx *gorm.DB) error {
if err := tx.Exec("UPDATE users SET balance = balance - 100 WHERE id = ?", 1).Error; err != nil {
return err
}
if err := db2.Exec("INSERT INTO orders (user_id, amount) VALUES (?, ?)", 1, 100).Error; err != nil {
return err
}
return nil
})
}
9. Conclusion
This article describes in detailPrinciples of sharding, primary key generation, paging query, distributed transactions, query without sharding key, capacity estimation and high availability solutions, and explains how to solve key problems with code examples.
🔗 References: