Once Matt showed me how easy this is, I instantly got an idea, Lustre, it's another distributed problem. In this case I didn't care about logs I cared about time series data, and I had two goals to solve.
- What is our filesystem performance over time, in both bandwidth and open/close opps.
- Find the users who open 999999 files/s in a single code.
All the config files used at the time of writing this are available on Github.
First some pictures:
So how did I easily get these data? Enter Logstash and the exec {} input. Because lustre stores all its summary stats in directories like /proc/fs/lustre/[mdt|obdfilter]/
json-stats.wrapper.py
./json-stats-wrapper.py /proc/fs/lustre/mdt/scratch-MDT0000/md_stats | python -mjson.toolLogstash if told that the input data is json, will now treat this as an event. In general the logstash-lustre.conf and logstash-lustre-mds.conf configs parse each event, including the path to the stats and grabs each counter and builds a metric form it.
{
"close": "241694077",
"crossdir_rename": "300771",
"getattr": "439797690",
"getxattr": "3393359",
"link": "117530",
"mkdir": "1332774",
"mknod": "1209",
"open": "789470206",
"rename": "522526",
"rmdir": "1289414",
"samedir_rename": "221755",
"setattr": "12991707",
"setxattr": "118798",
"snapshot_time": "1414810134.237384",
"source": "/proc/fs/lustre/mdt/scratch-MDT0000/md_stats",
"statfs": "799026",
"sync": "43951",
"unlink": "25767242"
}
One wants to be smart about your groupings. Lucky for us the Lustre devs did things in a very logical way, and it almost falls in our laps. You will use these groups/wildcards with graphite to quickly make lots of plots for the same metric over all OST's, MDT's, or clients, etc.
lustre.. . .
Eg: lustre.scratch.MDT.0000.open
Eg:lustre.scratch.OST.*.10-255-1-100.read_bytes
I don't calculate any rates in logstash. I chose to store the raw counter value from the stats files. Graphite has built in lots of nice functions that let you calculate your rates like nonNegativeDerivative() which deals with counters that roll over such as reboot. So keep your data quite raw.
Be careful with the number of metrics and your Graphite storage schema. We keep our per OST/MDT data for a year (10s:7d,10m:30d,60m:180d,6h:1y) and per client stats for 30 days (2m:30d). In our case we have over 1000 clients, 30 OSTs 1 MDT and we store 16 metrics/MDT, 16 metrics/MDT/client, 18 metrics/OST, 4 metrics/OST/client. In call we store 53,300 metrics just for lustre. This is about 14GB of data right now, and because of the way Graphite works will not grow unless we add more metrics, or add more clients.
The more likely problem is you send metrics to fast for graphite, and the number of IO's your graphite server disk can provide won't be up to task. In our case we update OST/MDT summary stats every 10 seconds, and client OST/MDT stats every 2 minutes. This is being handled with single 7,200 RPM SATA drive with some tweaking.
Future work:
- Use logstash to alert us for slow_attr and LBUG events.
- Use the DDN SFA SNMP MIB to get raw SFA counters into Graphite
- Alert if an OST is set deactivate or in recovery/out of recovery.
No comments:
Post a Comment