Badge-A-Day: Large Data Volumes
I've decided to start a professional development project: The Badge-A-Day project!
One of the great things about Salesforce is the educational materials they make available to their users via trailhead.salesforce.com.
They've gamified Salesforce in a way that makes learning the complex, in-depth, and (let's face it) at times intimidating suite of products they offer their customers.
So, as I continue my journey toward architect status, I've decided to challenge myself to earn a badge a day, Monday-Friday, for the rest of the year. The first badge I'll cover is Large Data Volumes.
Part 1: Design Your Data Model
Key Concepts:
- Plan Your Data Model: Large Data Volumes (LDVs) can lead to sluggish performance. Be smart and plan an enterprise-level data model even if you're just starting.
- Avoid Data Skew: This is important! When a parent record owns more than 10,000 child records, data skew occurs. Spread child accounts around to avoid clustering ownership into one record.
- Record Locking: Reduce data skew to avoid record locking. Record locking occurs when many updates occur at once. Parent records are locked when children are updated, so with account data skew, updates may fail when many children are being updated for the same parent.
- Ownership skew: Like account skew, ownership skew occurs when a single user owns more than 10,000 records of the same object type. This can cause performance issues during sharing recalculations.
- Lookup Skew: Basically, avoid tons of lookups on a single object type, since Salesforce must lock related records on lookup.
Part 2: Conduct Queries and Searches
Key Concepts:
- SOQL vs. SOSL: Use SOQL to query child to parent relationships and parent to child relationships. SOSL is a full text search language that tokenizes multiple terms within a field. If you're searching for a specific term, SOSL is faster than SOQL. However, the governor limit on SOSL queries is 2,000. On SOQL, it's 50,000.
- The best way to query large data sets is to do it asynchronously in batches.
- Use the Bulk API to retrieve queries up to 15 GB, and can return record that have been deleted because of a merge or delete.
- Use Skinny Tables: skinny tables are activated by Salesforce support. You determine a subset of important fields on your object, and Salesforce makes the skinny table. When the object is queried, instead of scanning the entire table, if Salesforce can, it references the skinny table instead to dramatically increase performance.
Part 3: Load Your Data
Key Concepts:
- Load Lean: Only load the fields that need to be updated. Don't clog up your upload with extraneous fields. Once you do, load your data as quickly as possible to reduce the scope of synchronization. Optimize your org through org-wide defaults, object relationships, sharing rules, and workflows, validations, etc. before you load to increase the speed of the load.
- Don't Load Too Lean: You always need record owners, parent records of master-detail. Sidebar: I'm bad about forgetting record types. Don't forget your record types!
Part 4: Perform Data Deletes and Extracts
Key Concepts:
- Soft vs. Hard Delete: Soft deletes are user-driven and go to the recycle bin. Hard deletes happen when the recycle bin is emptied or through the Bulk API, and permanently delete a record.
- Chunking Data: The Bulk API splits queries into batches of 100,000 by default. Large chunk sizes use up fewer Bulk API batches, but may not perform as well.
- PK Chunking: When records grow into the hundreds of millions, use PK chunking. PK = Primary Key (Record ID), instead of attributes as is done when you query via the Bulk API.
- Truncation: Truncating deletes all records associated with a custom object. This is perfect for sandboxes where you load a ton of data to test into a custom object but don't want to hassle with the Bulk API to delete it all.
Boom! And that's the 4-1-1 on the Large Data Volumes badge. Let me know if you have any questions in the comments below.