mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-03-28 10:18:25 +08:00
Pull tracing updates from Steven Rostedt:
"User visible changes:
- Add an entry into MAINTAINERS file for RUST versions of code
There's now RUST code for tracing and static branches. To
differentiate that code from the C code, add entries in for the
RUST version (with "[RUST]" around it) so that the right
maintainers get notified on changes.
- New bitmask-list option added to tracefs
When this is set, bitmasks in trace event are not displayed as hex
numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f
- New show_event_filters file in tracefs
Instead of having to search all events/*/*/filter for any active
filters enabled in the trace instance, the file show_event_filters
will list them so that there's only one file that needs to be
examined to see if any filters are active.
- New show_event_triggers file in tracefs
Instead of having to search all events/*/*/trigger for any active
triggers enabled in the trace instance, the file
show_event_triggers will list them so that there's only one file
that needs to be examined to see if any triggers are active.
- Have traceoff_on_warning disable trace pintk buffer too
Recently recording of trace_printk() could go to other trace
instances instead of the top level instance. But if
traceoff_on_warning triggers, it doesn't stop the buffer with
trace_printk() and that data can easily be lost by being
overwritten. Have traceoff_on_warning also disable the instance
that has trace_printk() being written to it.
- Update the hist_debug file to show what function the field uses
When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file
exists for every event. This displays the internal data of any
histogram enabled for that event. But it is lacking the function
that is called to process one of its fields. This is very useful
information that was missing when debugging histograms.
- Up the histogram stack size from 16 to 31
Stack traces can be used as keys for event histograms. Currently
the size of the stack that is stored is limited to just 16 entries.
But the storage space in the histogram is 256 bytes, meaning that
it can store up to 31 entries (plus one for the count of entries).
Instead of letting that space go to waste, up the limit from 16 to
31. This makes the keys much more useful.
- Fix permissions of per CPU file buffer_size_kb
The per CPU file of buffer_size_kb was incorrectly set to read only
in a previous cleanup. It should be writable.
- Reset "last_boot_info" if the persistent buffer is cleared
The last_boot_info shows address information of a persistent ring
buffer if it contains data from a previous boot. It is cleared when
recording starts again, but it is not cleared when the buffer is
reset. The data is useless after a reset so clear it on reset too.
Internal changes:
- A change was made to allow tracepoint callbacks to have preemption
enabled, and instead be protected by SRCU. This required some
updates to the callbacks for perf and BPF.
perf needed to disable preemption directly in its callback because
it expects preemption disabled in the later code.
BPF needed to disable migration, as its code expects to run
completely on the same CPU.
- Have irq_work wake up other CPU if current CPU is "isolated"
When there's a waiter waiting on ring buffer data and a new event
happens, an irq work is triggered to wake up that waiter. This is
noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a
house keeping CPU instead.
- Use proper free of trigger_data instead of open coding it in.
- Remove redundant call of event_trigger_reset_filter()
It was called immediately in a function that was called right after
it.
- Workqueue cleanups
- Report errors if tracing_update_buffers() were to fail.
- Make the enum update workqueue generic for other parts of tracing
On boot up, a work queue is created to convert enum names into
their numbers in the trace event format files. This work queue can
also be used for other aspects of tracing that takes some time and
shouldn't be called by the init call code.
The blk_trace initialization takes a bit of time. Have the
initialization code moved to the new tracing generic work queue
function.
- Skip kprobe boot event creation call if there's no kprobes defined
on cmdline
The kprobe initialization to set up kprobes if they are defined on
the cmdline requires taking the event_mutex lock. This can be held
by other tracing code doing initialization for a long time. Since
kprobes added to the kernel command line need to be setup
immediately, as they may be tracing early initialization code, they
cannot be postponed in a work queue and must be setup in the
initcall code.
If there's no kprobe on the kernel cmdline, there's no reason to
take the mutex and slow down the boot up code waiting to get the
lock only to find out there's nothing to do. Simply exit out early
if there's no kprobes on the kernel cmdline.
If there are kprobes on the cmdline, then someone cares more about
tracing over the speed of boot up.
- Clean up the trigger code a bit
- Move code out of trace.c and into their own files
trace.c is now over 11,000 lines of code and has become more
difficult to maintain. Start splitting it up so that related code
is in their own files.
Move all the trace_printk() related code into trace_printk.c.
Move the __always_inline stack functions into trace.h.
Move the pid filtering code into a new trace_pid.c file.
- Better define the max latency and snapshot code
The latency tracers have a "max latency" buffer that is a copy of
the main buffer and gets swapped with it when a new high latency is
detected. This keeps the trace up to the highest latency around
where this max_latency buffer is never written to. It is only used
to save the last max latency trace.
A while ago a snapshot feature was added to tracefs to allow user
space to perform the same logic. It could also enable events to
trigger a "snapshot" if one of their fields hit a new high. This
was built on top of the latency max_latency buffer logic.
Because snapshots came later, they were dependent on the latency
tracers to be enabled. In reality, the latency tracers depend on
the snapshot code and not the other way around. It was just that
they came first.
Restructure the code and the kconfigs to have the latency tracers
depend on snapshot code instead. This actually simplifies the logic
a bit and allows to disable more when the latency tracers are not
defined and the snapshot code is.
- Fix a "false sharing" in the hwlat tracer code
The loop to search for latency in hardware was using a variable
that could be changed by user space for each sample. If the user
change this variable, it could cause a bus contention, and reading
that variable can show up as a large latency in the trace causing a
false positive. Read this variable at the start of the sample with
a READ_ONCE() into a local variable and keep the code from sharing
cache lines with readers.
- Fix function graph tracer static branch optimization code
When only one tracer is defined for function graph tracing, it uses
a static branch to call that tracer directly. When another tracer
is added, it goes into loop logic to call all the registered
callbacks.
The code was incorrect when going back to one tracer and never
re-enabled the static branch again to do the optimization code.
- And other small fixes and cleanups"
* tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits)
function_graph: Restore direct mode when callbacks drop to one
tracing: Fix indentation of return statement in print_trace_fmt()
tracing: Reset last_boot_info if ring buffer is reset
tracing: Fix to set write permission to per-cpu buffer_size_kb
tracing: Fix false sharing in hwlat get_sample()
tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection
tracing: Better separate SNAPSHOT and MAX_TRACE options
tracing: Add tracer_uses_snapshot() helper to remove #ifdefs
tracing: Rename trace_array field max_buffer to snapshot_buffer
tracing: Move pid filtering into trace_pid.c
tracing: Move trace_printk functions out of trace.c and into trace_printk.c
tracing: Use system_state in trace_printk_init_buffers()
tracing: Have trace_printk functions use flags instead of using global_trace
tracing: Make tracing_update_buffers() take NULL for global_trace
tracing: Make printk_trace global for tracing system
tracing: Move ftrace_trace_stack() out of trace.c and into trace.h
tracing: Move __trace_buffer_{un}lock_*() functions to trace.h
tracing: Make tracing_selftest_running global to the tracing subsystem
tracing: Make tracing_disabled global for tracing system
tracing: Clean up use of trace_create_maxlat_file()
...
2222 lines
54 KiB
C
2222 lines
54 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/*
|
|
* Copyright (C) 2006 Jens Axboe <axboe@kernel.dk>
|
|
*
|
|
*/
|
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
#include <linux/kernel.h>
|
|
#include <linux/blkdev.h>
|
|
#include <linux/blktrace_api.h>
|
|
#include <linux/percpu.h>
|
|
#include <linux/init.h>
|
|
#include <linux/mutex.h>
|
|
#include <linux/slab.h>
|
|
#include <linux/debugfs.h>
|
|
#include <linux/export.h>
|
|
#include <linux/time.h>
|
|
#include <linux/uaccess.h>
|
|
#include <linux/list.h>
|
|
#include <linux/blk-cgroup.h>
|
|
|
|
#include "../../block/blk.h"
|
|
|
|
#include <trace/events/block.h>
|
|
|
|
#include "trace_output.h"
|
|
|
|
#ifdef CONFIG_BLK_DEV_IO_TRACE
|
|
|
|
static unsigned int blktrace_seq __read_mostly = 1;
|
|
|
|
static struct trace_array *blk_tr;
|
|
static bool blk_tracer_enabled __read_mostly;
|
|
|
|
static LIST_HEAD(running_trace_list);
|
|
static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(running_trace_lock);
|
|
|
|
/* Select an alternative, minimalistic output than the original one */
|
|
#define TRACE_BLK_OPT_CLASSIC 0x1
|
|
#define TRACE_BLK_OPT_CGROUP 0x2
|
|
#define TRACE_BLK_OPT_CGNAME 0x4
|
|
|
|
static struct tracer_opt blk_tracer_opts[] = {
|
|
/* Default disable the minimalistic output */
|
|
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
|
|
{ TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
|
|
#endif
|
|
{ }
|
|
};
|
|
|
|
static struct tracer_flags blk_tracer_flags = {
|
|
.val = 0,
|
|
.opts = blk_tracer_opts,
|
|
};
|
|
|
|
/* Global reference count of probes */
|
|
static DEFINE_MUTEX(blk_probe_mutex);
|
|
static int blk_probes_ref;
|
|
|
|
static void blk_register_tracepoints(void);
|
|
static void blk_unregister_tracepoints(void);
|
|
|
|
static void record_blktrace_event(struct blk_io_trace *t, pid_t pid, int cpu,
|
|
sector_t sector, int bytes, u64 what,
|
|
dev_t dev, int error, u64 cgid,
|
|
ssize_t cgid_len, void *pdu_data, int pdu_len)
|
|
|
|
{
|
|
/*
|
|
* These two are not needed in ftrace as they are in the
|
|
* generic trace_entry, filled by tracing_generic_entry_update,
|
|
* but for the trace_event->bin() synthesizer benefit we do it
|
|
* here too.
|
|
*/
|
|
t->cpu = cpu;
|
|
t->pid = pid;
|
|
|
|
t->sector = sector;
|
|
t->bytes = bytes;
|
|
t->action = lower_32_bits(what);
|
|
t->device = dev;
|
|
t->error = error;
|
|
t->pdu_len = pdu_len + cgid_len;
|
|
|
|
if (cgid_len)
|
|
memcpy((void *)t + sizeof(*t), &cgid, cgid_len);
|
|
if (pdu_len)
|
|
memcpy((void *)t + sizeof(*t) + cgid_len, pdu_data, pdu_len);
|
|
}
|
|
|
|
static void record_blktrace_event2(struct blk_io_trace2 *t2, pid_t pid, int cpu,
|
|
sector_t sector, int bytes, u64 what,
|
|
dev_t dev, int error, u64 cgid,
|
|
ssize_t cgid_len, void *pdu_data,
|
|
int pdu_len)
|
|
{
|
|
t2->pid = pid;
|
|
t2->cpu = cpu;
|
|
|
|
t2->sector = sector;
|
|
t2->bytes = bytes;
|
|
t2->action = what;
|
|
t2->device = dev;
|
|
t2->error = error;
|
|
t2->pdu_len = pdu_len + cgid_len;
|
|
|
|
if (cgid_len)
|
|
memcpy((void *)t2 + sizeof(*t2), &cgid, cgid_len);
|
|
if (pdu_len)
|
|
memcpy((void *)t2 + sizeof(*t2) + cgid_len, pdu_data, pdu_len);
|
|
}
|
|
|
|
static void relay_blktrace_event1(struct blk_trace *bt, unsigned long sequence,
|
|
pid_t pid, int cpu, sector_t sector, int bytes,
|
|
u64 what, int error, u64 cgid,
|
|
ssize_t cgid_len, void *pdu_data, int pdu_len)
|
|
{
|
|
struct blk_io_trace *t;
|
|
size_t trace_len = sizeof(*t) + pdu_len + cgid_len;
|
|
|
|
t = relay_reserve(bt->rchan, trace_len);
|
|
if (!t)
|
|
return;
|
|
|
|
t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
|
|
t->sequence = sequence;
|
|
t->time = ktime_to_ns(ktime_get());
|
|
|
|
record_blktrace_event(t, pid, cpu, sector, bytes, what, bt->dev, error,
|
|
cgid, cgid_len, pdu_data, pdu_len);
|
|
}
|
|
|
|
static void relay_blktrace_event2(struct blk_trace *bt, unsigned long sequence,
|
|
pid_t pid, int cpu, sector_t sector,
|
|
int bytes, u64 what, int error, u64 cgid,
|
|
ssize_t cgid_len, void *pdu_data, int pdu_len)
|
|
{
|
|
struct blk_io_trace2 *t;
|
|
size_t trace_len = sizeof(struct blk_io_trace2) + pdu_len + cgid_len;
|
|
|
|
t = relay_reserve(bt->rchan, trace_len);
|
|
if (!t)
|
|
return;
|
|
|
|
t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE2_VERSION;
|
|
t->sequence = sequence;
|
|
t->time = ktime_to_ns(ktime_get());
|
|
|
|
record_blktrace_event2(t, pid, cpu, sector, bytes, what, bt->dev, error,
|
|
cgid, cgid_len, pdu_data, pdu_len);
|
|
}
|
|
|
|
static void relay_blktrace_event(struct blk_trace *bt, unsigned long sequence,
|
|
pid_t pid, int cpu, sector_t sector, int bytes,
|
|
u64 what, int error, u64 cgid,
|
|
ssize_t cgid_len, void *pdu_data, int pdu_len)
|
|
{
|
|
if (bt->version == 2)
|
|
return relay_blktrace_event2(bt, sequence, pid, cpu, sector,
|
|
bytes, what, error, cgid, cgid_len,
|
|
pdu_data, pdu_len);
|
|
return relay_blktrace_event1(bt, sequence, pid, cpu, sector, bytes,
|
|
what, error, cgid, cgid_len, pdu_data,
|
|
pdu_len);
|
|
}
|
|
|
|
/*
|
|
* Send out a notify message.
|
|
*/
|
|
static void trace_note(struct blk_trace *bt, pid_t pid, u64 action,
|
|
const void *data, size_t len, u64 cgid)
|
|
{
|
|
struct ring_buffer_event *event = NULL;
|
|
struct trace_buffer *buffer = NULL;
|
|
unsigned int trace_ctx = 0;
|
|
int cpu = smp_processor_id();
|
|
bool blk_tracer = blk_tracer_enabled;
|
|
ssize_t cgid_len = cgid ? sizeof(cgid) : 0;
|
|
|
|
action = lower_32_bits(action | (cgid ? __BLK_TN_CGROUP : 0));
|
|
if (blk_tracer) {
|
|
struct blk_io_trace2 *t;
|
|
size_t trace_len = sizeof(*t) + cgid_len + len;
|
|
|
|
buffer = blk_tr->array_buffer.buffer;
|
|
trace_ctx = tracing_gen_ctx_flags(0);
|
|
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
|
|
trace_len, trace_ctx);
|
|
if (!event)
|
|
return;
|
|
t = ring_buffer_event_data(event);
|
|
record_blktrace_event2(t, pid, cpu, 0, 0,
|
|
action, bt->dev, 0, cgid, cgid_len,
|
|
(void *)data, len);
|
|
trace_buffer_unlock_commit(blk_tr, buffer, event, trace_ctx);
|
|
return;
|
|
}
|
|
|
|
if (!bt->rchan)
|
|
return;
|
|
|
|
relay_blktrace_event(bt, 0, pid, cpu, 0, 0, action, 0, cgid,
|
|
cgid_len, (void *)data, len);
|
|
}
|
|
|
|
/*
|
|
* Send out a notify for this process, if we haven't done so since a trace
|
|
* started
|
|
*/
|
|
static void trace_note_tsk(struct task_struct *tsk)
|
|
{
|
|
unsigned long flags;
|
|
struct blk_trace *bt;
|
|
|
|
tsk->btrace_seq = blktrace_seq;
|
|
raw_spin_lock_irqsave(&running_trace_lock, flags);
|
|
list_for_each_entry(bt, &running_trace_list, running_list) {
|
|
trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm,
|
|
sizeof(tsk->comm), 0);
|
|
}
|
|
raw_spin_unlock_irqrestore(&running_trace_lock, flags);
|
|
}
|
|
|
|
static void trace_note_time(struct blk_trace *bt)
|
|
{
|
|
struct timespec64 now;
|
|
unsigned long flags;
|
|
u32 words[2];
|
|
|
|
/* need to check user space to see if this breaks in y2038 or y2106 */
|
|
ktime_get_real_ts64(&now);
|
|
words[0] = (u32)now.tv_sec;
|
|
words[1] = now.tv_nsec;
|
|
|
|
local_irq_save(flags);
|
|
trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words), 0);
|
|
local_irq_restore(flags);
|
|
}
|
|
|
|
void __blk_trace_note_message(struct blk_trace *bt,
|
|
struct cgroup_subsys_state *css, const char *fmt, ...)
|
|
{
|
|
int n;
|
|
va_list args;
|
|
unsigned long flags;
|
|
char *buf;
|
|
u64 cgid = 0;
|
|
|
|
if (unlikely(bt->trace_state != Blktrace_running &&
|
|
!blk_tracer_enabled))
|
|
return;
|
|
|
|
/*
|
|
* If the BLK_TC_NOTIFY action mask isn't set, don't send any note
|
|
* message to the trace.
|
|
*/
|
|
if (!(bt->act_mask & BLK_TC_NOTIFY))
|
|
return;
|
|
|
|
local_irq_save(flags);
|
|
buf = this_cpu_ptr(bt->msg_data);
|
|
va_start(args, fmt);
|
|
n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
|
|
va_end(args);
|
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
if (css && (blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
|
|
cgid = cgroup_id(css->cgroup);
|
|
else
|
|
cgid = 1;
|
|
#endif
|
|
trace_note(bt, current->pid, BLK_TN_MESSAGE, buf, n, cgid);
|
|
local_irq_restore(flags);
|
|
}
|
|
EXPORT_SYMBOL_GPL(__blk_trace_note_message);
|
|
|
|
static int act_log_check(struct blk_trace *bt, u64 what, sector_t sector,
|
|
pid_t pid)
|
|
{
|
|
if (((bt->act_mask << BLK_TC_SHIFT) & what) == 0)
|
|
return 1;
|
|
if (sector && (sector < bt->start_lba || sector > bt->end_lba))
|
|
return 1;
|
|
if (bt->pid && pid != bt->pid)
|
|
return 1;
|
|
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* Data direction bit lookup
|
|
*/
|
|
static const u32 ddir_act[2] = { BLK_TC_ACT(BLK_TC_READ),
|
|
BLK_TC_ACT(BLK_TC_WRITE) };
|
|
|
|
#define BLK_TC_RAHEAD BLK_TC_AHEAD
|
|
#define BLK_TC_PREFLUSH BLK_TC_FLUSH
|
|
|
|
/* The ilog2() calls fall out because they're constant */
|
|
#define MASK_TC_BIT(rw, __name) ((__force u32)(rw & REQ_ ## __name) << \
|
|
(ilog2(BLK_TC_ ## __name) + BLK_TC_SHIFT - __REQ_ ## __name))
|
|
|
|
/*
|
|
* The worker for the various blk_add_trace*() types. Fills out a
|
|
* blk_io_trace structure and places it in a per-cpu subbuffer.
|
|
*/
|
|
static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
|
|
const blk_opf_t opf, u64 what, int error,
|
|
int pdu_len, void *pdu_data, u64 cgid)
|
|
{
|
|
struct task_struct *tsk = current;
|
|
struct ring_buffer_event *event = NULL;
|
|
struct trace_buffer *buffer = NULL;
|
|
unsigned long flags = 0;
|
|
unsigned long *sequence;
|
|
unsigned int trace_ctx = 0;
|
|
pid_t pid;
|
|
int cpu;
|
|
bool blk_tracer = blk_tracer_enabled;
|
|
ssize_t cgid_len = cgid ? sizeof(cgid) : 0;
|
|
const enum req_op op = opf & REQ_OP_MASK;
|
|
size_t trace_len;
|
|
|
|
if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer))
|
|
return;
|
|
|
|
what |= ddir_act[op_is_write(op) ? WRITE : READ];
|
|
what |= MASK_TC_BIT(opf, SYNC);
|
|
what |= MASK_TC_BIT(opf, RAHEAD);
|
|
what |= MASK_TC_BIT(opf, META);
|
|
what |= MASK_TC_BIT(opf, PREFLUSH);
|
|
what |= MASK_TC_BIT(opf, FUA);
|
|
|
|
switch (op) {
|
|
case REQ_OP_DISCARD:
|
|
case REQ_OP_SECURE_ERASE:
|
|
what |= BLK_TC_ACT(BLK_TC_DISCARD);
|
|
break;
|
|
case REQ_OP_FLUSH:
|
|
what |= BLK_TC_ACT(BLK_TC_FLUSH);
|
|
break;
|
|
case REQ_OP_ZONE_APPEND:
|
|
what |= BLK_TC_ACT(BLK_TC_ZONE_APPEND);
|
|
break;
|
|
case REQ_OP_ZONE_RESET:
|
|
what |= BLK_TC_ACT(BLK_TC_ZONE_RESET);
|
|
break;
|
|
case REQ_OP_ZONE_RESET_ALL:
|
|
what |= BLK_TC_ACT(BLK_TC_ZONE_RESET_ALL);
|
|
break;
|
|
case REQ_OP_ZONE_FINISH:
|
|
what |= BLK_TC_ACT(BLK_TC_ZONE_FINISH);
|
|
break;
|
|
case REQ_OP_ZONE_OPEN:
|
|
what |= BLK_TC_ACT(BLK_TC_ZONE_OPEN);
|
|
break;
|
|
case REQ_OP_ZONE_CLOSE:
|
|
what |= BLK_TC_ACT(BLK_TC_ZONE_CLOSE);
|
|
break;
|
|
case REQ_OP_WRITE_ZEROES:
|
|
what |= BLK_TC_ACT(BLK_TC_WRITE_ZEROES);
|
|
break;
|
|
default:
|
|
break;
|
|
}
|
|
|
|
/* Drop trace events for zone operations with blktrace v1 */
|
|
if (bt->version == 1 && (what >> BLK_TC_SHIFT) > BLK_TC_END_V1) {
|
|
pr_debug_ratelimited("blktrace v1 cannot trace zone operation 0x%llx\n",
|
|
(unsigned long long)what);
|
|
return;
|
|
}
|
|
|
|
if (cgid)
|
|
what |= __BLK_TA_CGROUP;
|
|
|
|
pid = tsk->pid;
|
|
if (act_log_check(bt, what, sector, pid))
|
|
return;
|
|
cpu = raw_smp_processor_id();
|
|
|
|
if (blk_tracer) {
|
|
tracing_record_cmdline(current);
|
|
|
|
buffer = blk_tr->array_buffer.buffer;
|
|
trace_ctx = tracing_gen_ctx_flags(0);
|
|
switch (bt->version) {
|
|
case 1:
|
|
trace_len = sizeof(struct blk_io_trace);
|
|
break;
|
|
case 2:
|
|
default:
|
|
/*
|
|
* ftrace always uses v2 (blk_io_trace2) format.
|
|
*
|
|
* For sysfs-enabled tracing path (enabled via
|
|
* /sys/block/DEV/trace/enable), blk_trace_setup_queue()
|
|
* never initializes bt->version, leaving it 0 from
|
|
* kzalloc(). We must handle version==0 safely here.
|
|
*
|
|
* Fall through to default to ensure we never hit the
|
|
* old bug where default set trace_len=0, causing
|
|
* buffer underflow and memory corruption.
|
|
*
|
|
* Always use v2 format for ftrace and normalize
|
|
* bt->version to 2 when uninitialized.
|
|
*/
|
|
trace_len = sizeof(struct blk_io_trace2);
|
|
if (bt->version == 0)
|
|
bt->version = 2;
|
|
break;
|
|
}
|
|
trace_len += pdu_len + cgid_len;
|
|
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
|
|
trace_len, trace_ctx);
|
|
if (!event)
|
|
return;
|
|
|
|
switch (bt->version) {
|
|
case 1:
|
|
record_blktrace_event(ring_buffer_event_data(event),
|
|
pid, cpu, sector, bytes,
|
|
what, bt->dev, error, cgid, cgid_len,
|
|
pdu_data, pdu_len);
|
|
break;
|
|
case 2:
|
|
default:
|
|
/*
|
|
* Use v2 recording function (record_blktrace_event2)
|
|
* which writes blk_io_trace2 structure with correct
|
|
* field layout:
|
|
* - 32-bit pid at offset 28
|
|
* - 64-bit action at offset 32
|
|
*
|
|
* Fall through to default handles version==0 case
|
|
* (from sysfs path), ensuring we always use correct
|
|
* v2 recording function to match the v2 buffer
|
|
* allocated above.
|
|
*/
|
|
record_blktrace_event2(ring_buffer_event_data(event),
|
|
pid, cpu, sector, bytes,
|
|
what, bt->dev, error, cgid, cgid_len,
|
|
pdu_data, pdu_len);
|
|
break;
|
|
}
|
|
|
|
trace_buffer_unlock_commit(blk_tr, buffer, event, trace_ctx);
|
|
return;
|
|
}
|
|
|
|
if (unlikely(tsk->btrace_seq != blktrace_seq))
|
|
trace_note_tsk(tsk);
|
|
|
|
/*
|
|
* A word about the locking here - we disable interrupts to reserve
|
|
* some space in the relay per-cpu buffer, to prevent an irq
|
|
* from coming in and stepping on our toes.
|
|
*/
|
|
local_irq_save(flags);
|
|
sequence = per_cpu_ptr(bt->sequence, cpu);
|
|
(*sequence)++;
|
|
relay_blktrace_event(bt, *sequence, pid, cpu, sector, bytes,
|
|
what, error, cgid, cgid_len, pdu_data, pdu_len);
|
|
local_irq_restore(flags);
|
|
}
|
|
|
|
static void blk_trace_free(struct request_queue *q, struct blk_trace *bt)
|
|
{
|
|
relay_close(bt->rchan);
|
|
|
|
/*
|
|
* If 'bt->dir' is not set, then both 'dropped' and 'msg' are created
|
|
* under 'q->debugfs_dir', thus lookup and remove them.
|
|
*/
|
|
if (!bt->dir) {
|
|
debugfs_lookup_and_remove("dropped", q->debugfs_dir);
|
|
debugfs_lookup_and_remove("msg", q->debugfs_dir);
|
|
} else {
|
|
debugfs_remove(bt->dir);
|
|
}
|
|
free_percpu(bt->sequence);
|
|
free_percpu(bt->msg_data);
|
|
kfree(bt);
|
|
}
|
|
|
|
static void get_probe_ref(void)
|
|
{
|
|
mutex_lock(&blk_probe_mutex);
|
|
if (++blk_probes_ref == 1)
|
|
blk_register_tracepoints();
|
|
mutex_unlock(&blk_probe_mutex);
|
|
}
|
|
|
|
static void put_probe_ref(void)
|
|
{
|
|
mutex_lock(&blk_probe_mutex);
|
|
if (!--blk_probes_ref)
|
|
blk_unregister_tracepoints();
|
|
mutex_unlock(&blk_probe_mutex);
|
|
}
|
|
|
|
static int blk_trace_start(struct blk_trace *bt)
|
|
{
|
|
if (bt->trace_state != Blktrace_setup &&
|
|
bt->trace_state != Blktrace_stopped)
|
|
return -EINVAL;
|
|
|
|
blktrace_seq++;
|
|
smp_mb();
|
|
bt->trace_state = Blktrace_running;
|
|
raw_spin_lock_irq(&running_trace_lock);
|
|
list_add(&bt->running_list, &running_trace_list);
|
|
raw_spin_unlock_irq(&running_trace_lock);
|
|
trace_note_time(bt);
|
|
|
|
return 0;
|
|
}
|
|
|
|
static int blk_trace_stop(struct blk_trace *bt)
|
|
{
|
|
if (bt->trace_state != Blktrace_running)
|
|
return -EINVAL;
|
|
|
|
bt->trace_state = Blktrace_stopped;
|
|
raw_spin_lock_irq(&running_trace_lock);
|
|
list_del_init(&bt->running_list);
|
|
raw_spin_unlock_irq(&running_trace_lock);
|
|
relay_flush(bt->rchan);
|
|
|
|
return 0;
|
|
}
|
|
|
|
static void blk_trace_cleanup(struct request_queue *q, struct blk_trace *bt)
|
|
{
|
|
blk_trace_stop(bt);
|
|
synchronize_rcu();
|
|
blk_trace_free(q, bt);
|
|
put_probe_ref();
|
|
}
|
|
|
|
static int __blk_trace_remove(struct request_queue *q)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
bt = rcu_replace_pointer(q->blk_trace, NULL,
|
|
lockdep_is_held(&q->debugfs_mutex));
|
|
if (!bt)
|
|
return -EINVAL;
|
|
|
|
blk_trace_cleanup(q, bt);
|
|
|
|
return 0;
|
|
}
|
|
|
|
int blk_trace_remove(struct request_queue *q)
|
|
{
|
|
int ret;
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
ret = __blk_trace_remove(q);
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL_GPL(blk_trace_remove);
|
|
|
|
static ssize_t blk_dropped_read(struct file *filp, char __user *buffer,
|
|
size_t count, loff_t *ppos)
|
|
{
|
|
struct blk_trace *bt = filp->private_data;
|
|
size_t dropped = relay_stats(bt->rchan, RELAY_STATS_BUF_FULL);
|
|
char buf[16];
|
|
|
|
snprintf(buf, sizeof(buf), "%zu\n", dropped);
|
|
|
|
return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf));
|
|
}
|
|
|
|
static const struct file_operations blk_dropped_fops = {
|
|
.owner = THIS_MODULE,
|
|
.open = simple_open,
|
|
.read = blk_dropped_read,
|
|
.llseek = default_llseek,
|
|
};
|
|
|
|
static ssize_t blk_msg_write(struct file *filp, const char __user *buffer,
|
|
size_t count, loff_t *ppos)
|
|
{
|
|
char *msg;
|
|
struct blk_trace *bt;
|
|
|
|
if (count >= BLK_TN_MAX_MSG)
|
|
return -EINVAL;
|
|
|
|
msg = memdup_user_nul(buffer, count);
|
|
if (IS_ERR(msg))
|
|
return PTR_ERR(msg);
|
|
|
|
bt = filp->private_data;
|
|
__blk_trace_note_message(bt, NULL, "%s", msg);
|
|
kfree(msg);
|
|
|
|
return count;
|
|
}
|
|
|
|
static const struct file_operations blk_msg_fops = {
|
|
.owner = THIS_MODULE,
|
|
.open = simple_open,
|
|
.write = blk_msg_write,
|
|
.llseek = noop_llseek,
|
|
};
|
|
|
|
static int blk_remove_buf_file_callback(struct dentry *dentry)
|
|
{
|
|
debugfs_remove(dentry);
|
|
|
|
return 0;
|
|
}
|
|
|
|
static struct dentry *blk_create_buf_file_callback(const char *filename,
|
|
struct dentry *parent,
|
|
umode_t mode,
|
|
struct rchan_buf *buf,
|
|
int *is_global)
|
|
{
|
|
return debugfs_create_file(filename, mode, parent, buf,
|
|
&relay_file_operations);
|
|
}
|
|
|
|
static const struct rchan_callbacks blk_relay_callbacks = {
|
|
.create_buf_file = blk_create_buf_file_callback,
|
|
.remove_buf_file = blk_remove_buf_file_callback,
|
|
};
|
|
|
|
static void blk_trace_setup_lba(struct blk_trace *bt,
|
|
struct block_device *bdev)
|
|
{
|
|
if (bdev) {
|
|
bt->start_lba = bdev->bd_start_sect;
|
|
bt->end_lba = bdev->bd_start_sect + bdev_nr_sectors(bdev);
|
|
} else {
|
|
bt->start_lba = 0;
|
|
bt->end_lba = -1ULL;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Setup everything required to start tracing
|
|
*/
|
|
static struct blk_trace *blk_trace_setup_prepare(struct request_queue *q,
|
|
char *name, dev_t dev,
|
|
u32 buf_size, u32 buf_nr,
|
|
struct block_device *bdev)
|
|
{
|
|
struct blk_trace *bt = NULL;
|
|
struct dentry *dir = NULL;
|
|
int ret;
|
|
|
|
lockdep_assert_held(&q->debugfs_mutex);
|
|
|
|
/*
|
|
* bdev can be NULL, as with scsi-generic, this is a helpful as
|
|
* we can be.
|
|
*/
|
|
if (rcu_dereference_protected(q->blk_trace,
|
|
lockdep_is_held(&q->debugfs_mutex))) {
|
|
pr_warn("Concurrent blktraces are not allowed on %s\n", name);
|
|
return ERR_PTR(-EBUSY);
|
|
}
|
|
|
|
bt = kzalloc(sizeof(*bt), GFP_KERNEL);
|
|
if (!bt)
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
ret = -ENOMEM;
|
|
bt->sequence = alloc_percpu(unsigned long);
|
|
if (!bt->sequence)
|
|
goto err;
|
|
|
|
bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG, __alignof__(char));
|
|
if (!bt->msg_data)
|
|
goto err;
|
|
|
|
/*
|
|
* When tracing the whole disk reuse the existing debugfs directory
|
|
* created by the block layer on init. For partitions block devices,
|
|
* and scsi-generic block devices we create a temporary new debugfs
|
|
* directory that will be removed once the trace ends.
|
|
*/
|
|
if (bdev && !bdev_is_partition(bdev))
|
|
dir = q->debugfs_dir;
|
|
else
|
|
bt->dir = dir = debugfs_create_dir(name, blk_debugfs_root);
|
|
|
|
/*
|
|
* As blktrace relies on debugfs for its interface the debugfs directory
|
|
* is required, contrary to the usual mantra of not checking for debugfs
|
|
* files or directories.
|
|
*/
|
|
if (IS_ERR_OR_NULL(dir)) {
|
|
pr_warn("debugfs_dir not present for %s so skipping\n", name);
|
|
ret = -ENOENT;
|
|
goto err;
|
|
}
|
|
|
|
bt->dev = dev;
|
|
INIT_LIST_HEAD(&bt->running_list);
|
|
|
|
ret = -EIO;
|
|
debugfs_create_file("dropped", 0444, dir, bt, &blk_dropped_fops);
|
|
debugfs_create_file("msg", 0222, dir, bt, &blk_msg_fops);
|
|
|
|
bt->rchan = relay_open("trace", dir, buf_size, buf_nr,
|
|
&blk_relay_callbacks, bt);
|
|
if (!bt->rchan)
|
|
goto err;
|
|
|
|
blk_trace_setup_lba(bt, bdev);
|
|
|
|
return bt;
|
|
|
|
err:
|
|
blk_trace_free(q, bt);
|
|
|
|
return ERR_PTR(ret);
|
|
}
|
|
|
|
static void blk_trace_setup_finalize(struct request_queue *q,
|
|
char *name, int version,
|
|
struct blk_trace *bt,
|
|
struct blk_user_trace_setup2 *buts)
|
|
|
|
{
|
|
strscpy_pad(buts->name, name, BLKTRACE_BDEV_SIZE2);
|
|
|
|
/*
|
|
* some device names have larger paths - convert the slashes
|
|
* to underscores for this to work as expected
|
|
*/
|
|
strreplace(buts->name, '/', '_');
|
|
|
|
bt->version = version;
|
|
bt->act_mask = buts->act_mask;
|
|
if (!bt->act_mask)
|
|
bt->act_mask = (u16) -1;
|
|
|
|
/* overwrite with user settings */
|
|
if (buts->start_lba)
|
|
bt->start_lba = buts->start_lba;
|
|
if (buts->end_lba)
|
|
bt->end_lba = buts->end_lba;
|
|
|
|
bt->pid = buts->pid;
|
|
bt->trace_state = Blktrace_setup;
|
|
|
|
rcu_assign_pointer(q->blk_trace, bt);
|
|
get_probe_ref();
|
|
}
|
|
|
|
int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
|
|
struct block_device *bdev,
|
|
char __user *arg)
|
|
{
|
|
struct blk_user_trace_setup2 buts2;
|
|
struct blk_user_trace_setup buts;
|
|
struct blk_trace *bt;
|
|
int ret;
|
|
|
|
ret = copy_from_user(&buts, arg, sizeof(buts));
|
|
if (ret)
|
|
return -EFAULT;
|
|
|
|
if (!buts.buf_size || !buts.buf_nr)
|
|
return -EINVAL;
|
|
|
|
buts2 = (struct blk_user_trace_setup2) {
|
|
.act_mask = buts.act_mask,
|
|
.buf_size = buts.buf_size,
|
|
.buf_nr = buts.buf_nr,
|
|
.start_lba = buts.start_lba,
|
|
.end_lba = buts.end_lba,
|
|
.pid = buts.pid,
|
|
};
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
bt = blk_trace_setup_prepare(q, name, dev, buts.buf_size, buts.buf_nr,
|
|
bdev);
|
|
if (IS_ERR(bt)) {
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
return PTR_ERR(bt);
|
|
}
|
|
blk_trace_setup_finalize(q, name, 1, bt, &buts2);
|
|
strscpy(buts.name, buts2.name, BLKTRACE_BDEV_SIZE);
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
|
|
if (copy_to_user(arg, &buts, sizeof(buts))) {
|
|
blk_trace_remove(q);
|
|
return -EFAULT;
|
|
}
|
|
return 0;
|
|
}
|
|
EXPORT_SYMBOL_GPL(blk_trace_setup);
|
|
|
|
static int blk_trace_setup2(struct request_queue *q, char *name, dev_t dev,
|
|
struct block_device *bdev, char __user *arg)
|
|
{
|
|
struct blk_user_trace_setup2 buts2;
|
|
struct blk_trace *bt;
|
|
|
|
if (copy_from_user(&buts2, arg, sizeof(buts2)))
|
|
return -EFAULT;
|
|
|
|
if (!buts2.buf_size || !buts2.buf_nr)
|
|
return -EINVAL;
|
|
|
|
if (buts2.flags != 0)
|
|
return -EINVAL;
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
bt = blk_trace_setup_prepare(q, name, dev, buts2.buf_size, buts2.buf_nr,
|
|
bdev);
|
|
if (IS_ERR(bt)) {
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
return PTR_ERR(bt);
|
|
}
|
|
blk_trace_setup_finalize(q, name, 2, bt, &buts2);
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
|
|
if (copy_to_user(arg, &buts2, sizeof(buts2))) {
|
|
blk_trace_remove(q);
|
|
return -EFAULT;
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
#if defined(CONFIG_COMPAT) && defined(CONFIG_X86_64)
|
|
static int compat_blk_trace_setup(struct request_queue *q, char *name,
|
|
dev_t dev, struct block_device *bdev,
|
|
char __user *arg)
|
|
{
|
|
struct blk_user_trace_setup2 buts2;
|
|
struct compat_blk_user_trace_setup cbuts;
|
|
struct blk_trace *bt;
|
|
|
|
if (copy_from_user(&cbuts, arg, sizeof(cbuts)))
|
|
return -EFAULT;
|
|
|
|
if (!cbuts.buf_size || !cbuts.buf_nr)
|
|
return -EINVAL;
|
|
|
|
buts2 = (struct blk_user_trace_setup2) {
|
|
.act_mask = cbuts.act_mask,
|
|
.buf_size = cbuts.buf_size,
|
|
.buf_nr = cbuts.buf_nr,
|
|
.start_lba = cbuts.start_lba,
|
|
.end_lba = cbuts.end_lba,
|
|
.pid = cbuts.pid,
|
|
};
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
bt = blk_trace_setup_prepare(q, name, dev, buts2.buf_size, buts2.buf_nr,
|
|
bdev);
|
|
if (IS_ERR(bt)) {
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
return PTR_ERR(bt);
|
|
}
|
|
blk_trace_setup_finalize(q, name, 1, bt, &buts2);
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
|
|
if (copy_to_user(arg, &buts2.name, ARRAY_SIZE(buts2.name))) {
|
|
blk_trace_remove(q);
|
|
return -EFAULT;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
#endif
|
|
|
|
static int __blk_trace_startstop(struct request_queue *q, int start)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
bt = rcu_dereference_protected(q->blk_trace,
|
|
lockdep_is_held(&q->debugfs_mutex));
|
|
if (bt == NULL)
|
|
return -EINVAL;
|
|
|
|
if (start)
|
|
return blk_trace_start(bt);
|
|
else
|
|
return blk_trace_stop(bt);
|
|
}
|
|
|
|
int blk_trace_startstop(struct request_queue *q, int start)
|
|
{
|
|
int ret;
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
ret = __blk_trace_startstop(q, start);
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL_GPL(blk_trace_startstop);
|
|
|
|
/*
|
|
* When reading or writing the blktrace sysfs files, the references to the
|
|
* opened sysfs or device files should prevent the underlying block device
|
|
* from being removed. So no further delete protection is really needed.
|
|
*/
|
|
|
|
/**
|
|
* blk_trace_ioctl - handle the ioctls associated with tracing
|
|
* @bdev: the block device
|
|
* @cmd: the ioctl cmd
|
|
* @arg: the argument data, if any
|
|
*
|
|
**/
|
|
int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg)
|
|
{
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
int ret, start = 0;
|
|
char b[BDEVNAME_SIZE];
|
|
|
|
switch (cmd) {
|
|
case BLKTRACESETUP2:
|
|
snprintf(b, sizeof(b), "%pg", bdev);
|
|
ret = blk_trace_setup2(q, b, bdev->bd_dev, bdev, arg);
|
|
break;
|
|
case BLKTRACESETUP:
|
|
snprintf(b, sizeof(b), "%pg", bdev);
|
|
ret = blk_trace_setup(q, b, bdev->bd_dev, bdev, arg);
|
|
break;
|
|
#if defined(CONFIG_COMPAT) && defined(CONFIG_X86_64)
|
|
case BLKTRACESETUP32:
|
|
snprintf(b, sizeof(b), "%pg", bdev);
|
|
ret = compat_blk_trace_setup(q, b, bdev->bd_dev, bdev, arg);
|
|
break;
|
|
#endif
|
|
case BLKTRACESTART:
|
|
start = 1;
|
|
fallthrough;
|
|
case BLKTRACESTOP:
|
|
ret = blk_trace_startstop(q, start);
|
|
break;
|
|
case BLKTRACETEARDOWN:
|
|
ret = blk_trace_remove(q);
|
|
break;
|
|
default:
|
|
ret = -ENOTTY;
|
|
break;
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
/**
|
|
* blk_trace_shutdown - stop and cleanup trace structures
|
|
* @q: the request queue associated with the device
|
|
*
|
|
**/
|
|
void blk_trace_shutdown(struct request_queue *q)
|
|
{
|
|
if (rcu_dereference_protected(q->blk_trace,
|
|
lockdep_is_held(&q->debugfs_mutex)))
|
|
__blk_trace_remove(q);
|
|
}
|
|
|
|
#ifdef CONFIG_BLK_CGROUP
|
|
static u64 blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
|
|
{
|
|
struct cgroup_subsys_state *blkcg_css;
|
|
struct blk_trace *bt;
|
|
|
|
/* We don't use the 'bt' value here except as an optimization... */
|
|
bt = rcu_dereference_protected(q->blk_trace, 1);
|
|
if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
|
|
return 0;
|
|
|
|
blkcg_css = bio_blkcg_css(bio);
|
|
if (!blkcg_css)
|
|
return 0;
|
|
return cgroup_id(blkcg_css->cgroup);
|
|
}
|
|
#else
|
|
static u64 blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
|
|
{
|
|
return 0;
|
|
}
|
|
#endif
|
|
|
|
static u64
|
|
blk_trace_request_get_cgid(struct request *rq)
|
|
{
|
|
if (!rq->bio)
|
|
return 0;
|
|
/* Use the first bio */
|
|
return blk_trace_bio_get_cgid(rq->q, rq->bio);
|
|
}
|
|
|
|
/*
|
|
* blktrace probes
|
|
*/
|
|
|
|
/**
|
|
* blk_add_trace_rq - Add a trace for a request oriented action
|
|
* @rq: the source request
|
|
* @error: return status to log
|
|
* @nr_bytes: number of completed bytes
|
|
* @what: the action
|
|
* @cgid: the cgroup info
|
|
*
|
|
* Description:
|
|
* Records an action against a request. Will log the bio offset + size.
|
|
*
|
|
**/
|
|
static void blk_add_trace_rq(struct request *rq, blk_status_t error,
|
|
unsigned int nr_bytes, u64 what, u64 cgid)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(rq->q->blk_trace);
|
|
if (likely(!bt)) {
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
|
|
if (blk_rq_is_passthrough(rq))
|
|
what |= BLK_TC_ACT(BLK_TC_PC);
|
|
else
|
|
what |= BLK_TC_ACT(BLK_TC_FS);
|
|
|
|
__blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, rq->cmd_flags,
|
|
what, blk_status_to_errno(error), 0, NULL, cgid);
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
static void blk_add_trace_rq_insert(void *ignore, struct request *rq)
|
|
{
|
|
blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT,
|
|
blk_trace_request_get_cgid(rq));
|
|
}
|
|
|
|
static void blk_add_trace_rq_issue(void *ignore, struct request *rq)
|
|
{
|
|
blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE,
|
|
blk_trace_request_get_cgid(rq));
|
|
}
|
|
|
|
static void blk_add_trace_rq_merge(void *ignore, struct request *rq)
|
|
{
|
|
blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_BACKMERGE,
|
|
blk_trace_request_get_cgid(rq));
|
|
}
|
|
|
|
static void blk_add_trace_rq_requeue(void *ignore, struct request *rq)
|
|
{
|
|
blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE,
|
|
blk_trace_request_get_cgid(rq));
|
|
}
|
|
|
|
static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
|
|
blk_status_t error, unsigned int nr_bytes)
|
|
{
|
|
blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE,
|
|
blk_trace_request_get_cgid(rq));
|
|
}
|
|
|
|
static void blk_add_trace_zone_update_request(void *ignore, struct request *rq)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(rq->q->blk_trace);
|
|
if (likely(!bt) || bt->version < 2) {
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
rcu_read_unlock();
|
|
|
|
blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ZONE_APPEND,
|
|
blk_trace_request_get_cgid(rq));
|
|
}
|
|
|
|
/**
|
|
* blk_add_trace_bio - Add a trace for a bio oriented action
|
|
* @q: queue the io is for
|
|
* @bio: the source bio
|
|
* @what: the action
|
|
* @error: error, if any
|
|
*
|
|
* Description:
|
|
* Records an action against a bio. Will log the bio offset + size.
|
|
*
|
|
**/
|
|
static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
|
|
u64 what, int error)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (likely(!bt)) {
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
|
|
__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
|
|
bio->bi_opf, what, error, 0, NULL,
|
|
blk_trace_bio_get_cgid(q, bio));
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
static void blk_add_trace_bio_complete(void *ignore,
|
|
struct request_queue *q, struct bio *bio)
|
|
{
|
|
blk_add_trace_bio(q, bio, BLK_TA_COMPLETE,
|
|
blk_status_to_errno(bio->bi_status));
|
|
}
|
|
|
|
static void blk_add_trace_bio_backmerge(void *ignore, struct bio *bio)
|
|
{
|
|
blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_BACKMERGE,
|
|
0);
|
|
}
|
|
|
|
static void blk_add_trace_bio_frontmerge(void *ignore, struct bio *bio)
|
|
{
|
|
blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_FRONTMERGE,
|
|
0);
|
|
}
|
|
|
|
static void blk_add_trace_bio_queue(void *ignore, struct bio *bio)
|
|
{
|
|
blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_QUEUE, 0);
|
|
}
|
|
|
|
static void blk_add_trace_getrq(void *ignore, struct bio *bio)
|
|
{
|
|
blk_add_trace_bio(bio->bi_bdev->bd_disk->queue, bio, BLK_TA_GETRQ, 0);
|
|
}
|
|
|
|
static void blk_add_trace_plug(void *ignore, struct request_queue *q)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (bt)
|
|
__blk_add_trace(bt, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, 0);
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
|
|
unsigned int depth, bool explicit)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (bt) {
|
|
__be64 rpdu = cpu_to_be64(depth);
|
|
u64 what;
|
|
|
|
if (explicit)
|
|
what = BLK_TA_UNPLUG_IO;
|
|
else
|
|
what = BLK_TA_UNPLUG_TIMER;
|
|
|
|
__blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, 0);
|
|
}
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
static void blk_add_trace_zone_plug(void *ignore, struct request_queue *q,
|
|
unsigned int zno, sector_t sector,
|
|
unsigned int sectors)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (bt && bt->version >= 2)
|
|
__blk_add_trace(bt, sector, sectors << SECTOR_SHIFT, 0,
|
|
BLK_TA_ZONE_PLUG, 0, 0, NULL, 0);
|
|
rcu_read_unlock();
|
|
|
|
return;
|
|
}
|
|
|
|
static void blk_add_trace_zone_unplug(void *ignore, struct request_queue *q,
|
|
unsigned int zno, sector_t sector,
|
|
unsigned int sectors)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (bt && bt->version >= 2)
|
|
__blk_add_trace(bt, sector, sectors << SECTOR_SHIFT, 0,
|
|
BLK_TA_ZONE_UNPLUG, 0, 0, NULL, 0);
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
|
|
static void blk_add_trace_split(void *ignore, struct bio *bio, unsigned int pdu)
|
|
{
|
|
struct request_queue *q = bio->bi_bdev->bd_disk->queue;
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (bt) {
|
|
__be64 rpdu = cpu_to_be64(pdu);
|
|
|
|
__blk_add_trace(bt, bio->bi_iter.bi_sector,
|
|
bio->bi_iter.bi_size, bio->bi_opf, BLK_TA_SPLIT,
|
|
blk_status_to_errno(bio->bi_status),
|
|
sizeof(rpdu), &rpdu,
|
|
blk_trace_bio_get_cgid(q, bio));
|
|
}
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
/**
|
|
* blk_add_trace_bio_remap - Add a trace for a bio-remap operation
|
|
* @ignore: trace callback data parameter (not used)
|
|
* @bio: the source bio
|
|
* @dev: source device
|
|
* @from: source sector
|
|
*
|
|
* Called after a bio is remapped to a different device and/or sector.
|
|
**/
|
|
static void blk_add_trace_bio_remap(void *ignore, struct bio *bio, dev_t dev,
|
|
sector_t from)
|
|
{
|
|
struct request_queue *q = bio->bi_bdev->bd_disk->queue;
|
|
struct blk_trace *bt;
|
|
struct blk_io_trace_remap r;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(q->blk_trace);
|
|
if (likely(!bt)) {
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
|
|
r.device_from = cpu_to_be32(dev);
|
|
r.device_to = cpu_to_be32(bio_dev(bio));
|
|
r.sector_from = cpu_to_be64(from);
|
|
|
|
__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
|
|
bio->bi_opf, BLK_TA_REMAP,
|
|
blk_status_to_errno(bio->bi_status),
|
|
sizeof(r), &r, blk_trace_bio_get_cgid(q, bio));
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
/**
|
|
* blk_add_trace_rq_remap - Add a trace for a request-remap operation
|
|
* @ignore: trace callback data parameter (not used)
|
|
* @rq: the source request
|
|
* @dev: target device
|
|
* @from: source sector
|
|
*
|
|
* Description:
|
|
* Device mapper remaps request to other devices.
|
|
* Add a trace for that action.
|
|
*
|
|
**/
|
|
static void blk_add_trace_rq_remap(void *ignore, struct request *rq, dev_t dev,
|
|
sector_t from)
|
|
{
|
|
struct blk_trace *bt;
|
|
struct blk_io_trace_remap r;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(rq->q->blk_trace);
|
|
if (likely(!bt)) {
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
|
|
r.device_from = cpu_to_be32(dev);
|
|
r.device_to = cpu_to_be32(disk_devt(rq->q->disk));
|
|
r.sector_from = cpu_to_be64(from);
|
|
|
|
__blk_add_trace(bt, blk_rq_pos(rq), blk_rq_bytes(rq),
|
|
rq->cmd_flags, BLK_TA_REMAP, 0,
|
|
sizeof(r), &r, blk_trace_request_get_cgid(rq));
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
/**
|
|
* blk_add_driver_data - Add binary message with driver-specific data
|
|
* @rq: io request
|
|
* @data: driver-specific data
|
|
* @len: length of driver-specific data
|
|
*
|
|
* Description:
|
|
* Some drivers might want to write driver-specific data per request.
|
|
*
|
|
**/
|
|
void blk_add_driver_data(struct request *rq, void *data, size_t len)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
rcu_read_lock();
|
|
bt = rcu_dereference(rq->q->blk_trace);
|
|
if (likely(!bt)) {
|
|
rcu_read_unlock();
|
|
return;
|
|
}
|
|
|
|
__blk_add_trace(bt, blk_rq_trace_sector(rq), blk_rq_bytes(rq), 0,
|
|
BLK_TA_DRV_DATA, 0, len, data,
|
|
blk_trace_request_get_cgid(rq));
|
|
rcu_read_unlock();
|
|
}
|
|
EXPORT_SYMBOL_GPL(blk_add_driver_data);
|
|
|
|
static void blk_register_tracepoints(void)
|
|
{
|
|
int ret;
|
|
|
|
ret = register_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_rq_issue(blk_add_trace_rq_issue, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_rq_merge(blk_add_trace_rq_merge, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_rq_requeue(blk_add_trace_rq_requeue, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_rq_complete(blk_add_trace_rq_complete, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_bio_complete(blk_add_trace_bio_complete, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_bio_backmerge(blk_add_trace_bio_backmerge, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_bio_queue(blk_add_trace_bio_queue, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_getrq(blk_add_trace_getrq, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_blk_zone_append_update_request_bio(
|
|
blk_add_trace_zone_update_request, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_disk_zone_wplug_add_bio(blk_add_trace_zone_plug,
|
|
NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_blk_zone_wplug_bio(blk_add_trace_zone_unplug,
|
|
NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_plug(blk_add_trace_plug, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_unplug(blk_add_trace_unplug, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_split(blk_add_trace_split, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_bio_remap(blk_add_trace_bio_remap, NULL);
|
|
WARN_ON(ret);
|
|
ret = register_trace_block_rq_remap(blk_add_trace_rq_remap, NULL);
|
|
WARN_ON(ret);
|
|
}
|
|
|
|
static void blk_unregister_tracepoints(void)
|
|
{
|
|
unregister_trace_block_rq_remap(blk_add_trace_rq_remap, NULL);
|
|
unregister_trace_block_bio_remap(blk_add_trace_bio_remap, NULL);
|
|
unregister_trace_block_split(blk_add_trace_split, NULL);
|
|
unregister_trace_block_unplug(blk_add_trace_unplug, NULL);
|
|
unregister_trace_block_plug(blk_add_trace_plug, NULL);
|
|
unregister_trace_blk_zone_wplug_bio(blk_add_trace_zone_unplug, NULL);
|
|
unregister_trace_disk_zone_wplug_add_bio(blk_add_trace_zone_plug, NULL);
|
|
unregister_trace_blk_zone_append_update_request_bio(
|
|
blk_add_trace_zone_update_request, NULL);
|
|
unregister_trace_block_getrq(blk_add_trace_getrq, NULL);
|
|
unregister_trace_block_bio_queue(blk_add_trace_bio_queue, NULL);
|
|
unregister_trace_block_bio_frontmerge(blk_add_trace_bio_frontmerge, NULL);
|
|
unregister_trace_block_bio_backmerge(blk_add_trace_bio_backmerge, NULL);
|
|
unregister_trace_block_bio_complete(blk_add_trace_bio_complete, NULL);
|
|
unregister_trace_block_rq_complete(blk_add_trace_rq_complete, NULL);
|
|
unregister_trace_block_rq_requeue(blk_add_trace_rq_requeue, NULL);
|
|
unregister_trace_block_rq_merge(blk_add_trace_rq_merge, NULL);
|
|
unregister_trace_block_rq_issue(blk_add_trace_rq_issue, NULL);
|
|
unregister_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);
|
|
|
|
tracepoint_synchronize_unregister();
|
|
}
|
|
|
|
/*
|
|
* struct blk_io_tracer formatting routines
|
|
*/
|
|
|
|
static void fill_rwbs(char *rwbs, const struct blk_io_trace2 *t)
|
|
{
|
|
int i = 0;
|
|
int tc = t->action >> BLK_TC_SHIFT;
|
|
|
|
if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
|
|
rwbs[i++] = 'N';
|
|
goto out;
|
|
}
|
|
|
|
if (tc & BLK_TC_FLUSH)
|
|
rwbs[i++] = 'F';
|
|
|
|
if (tc & BLK_TC_DISCARD)
|
|
rwbs[i++] = 'D';
|
|
else if (tc & BLK_TC_WRITE_ZEROES) {
|
|
rwbs[i++] = 'W';
|
|
rwbs[i++] = 'Z';
|
|
} else if (tc & BLK_TC_WRITE)
|
|
rwbs[i++] = 'W';
|
|
else if (t->bytes)
|
|
rwbs[i++] = 'R';
|
|
else
|
|
rwbs[i++] = 'N';
|
|
|
|
if (tc & BLK_TC_FUA)
|
|
rwbs[i++] = 'F';
|
|
if (tc & BLK_TC_AHEAD)
|
|
rwbs[i++] = 'A';
|
|
if (tc & BLK_TC_SYNC)
|
|
rwbs[i++] = 'S';
|
|
if (tc & BLK_TC_META)
|
|
rwbs[i++] = 'M';
|
|
out:
|
|
rwbs[i] = '\0';
|
|
}
|
|
|
|
static inline
|
|
const struct blk_io_trace2 *te_blk_io_trace(const struct trace_entry *ent)
|
|
{
|
|
return (const struct blk_io_trace2 *)ent;
|
|
}
|
|
|
|
static inline const void *pdu_start(const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
return (void *)(te_blk_io_trace(ent) + 1) + (has_cg ? sizeof(u64) : 0);
|
|
}
|
|
|
|
static inline u64 t_cgid(const struct trace_entry *ent)
|
|
{
|
|
return *(u64 *)(te_blk_io_trace(ent) + 1);
|
|
}
|
|
|
|
static inline int pdu_real_len(const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
return te_blk_io_trace(ent)->pdu_len - (has_cg ? sizeof(u64) : 0);
|
|
}
|
|
|
|
static inline u32 t_action(const struct trace_entry *ent)
|
|
{
|
|
return te_blk_io_trace(ent)->action;
|
|
}
|
|
|
|
static inline u32 t_bytes(const struct trace_entry *ent)
|
|
{
|
|
return te_blk_io_trace(ent)->bytes;
|
|
}
|
|
|
|
static inline u32 t_sec(const struct trace_entry *ent)
|
|
{
|
|
return te_blk_io_trace(ent)->bytes >> 9;
|
|
}
|
|
|
|
static inline unsigned long long t_sector(const struct trace_entry *ent)
|
|
{
|
|
return te_blk_io_trace(ent)->sector;
|
|
}
|
|
|
|
static inline __u16 t_error(const struct trace_entry *ent)
|
|
{
|
|
return te_blk_io_trace(ent)->error;
|
|
}
|
|
|
|
static __u64 get_pdu_int(const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
const __be64 *val = pdu_start(ent, has_cg);
|
|
return be64_to_cpu(*val);
|
|
}
|
|
|
|
typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act,
|
|
bool has_cg);
|
|
|
|
static void blk_log_action_classic(struct trace_iterator *iter, const char *act,
|
|
bool has_cg)
|
|
{
|
|
char rwbs[RWBS_LEN];
|
|
unsigned long long ts = iter->ts;
|
|
unsigned long nsec_rem = do_div(ts, NSEC_PER_SEC);
|
|
unsigned secs = (unsigned long)ts;
|
|
const struct blk_io_trace2 *t = te_blk_io_trace(iter->ent);
|
|
|
|
fill_rwbs(rwbs, t);
|
|
|
|
trace_seq_printf(&iter->seq,
|
|
"%3d,%-3d %2d %5d.%09lu %5u %2s %3s ",
|
|
MAJOR(t->device), MINOR(t->device), iter->cpu,
|
|
secs, nsec_rem, iter->ent->pid, act, rwbs);
|
|
}
|
|
|
|
static void blk_log_action(struct trace_iterator *iter, const char *act,
|
|
bool has_cg)
|
|
{
|
|
char rwbs[RWBS_LEN];
|
|
const struct blk_io_trace2 *t = te_blk_io_trace(iter->ent);
|
|
|
|
fill_rwbs(rwbs, t);
|
|
if (has_cg) {
|
|
u64 id = t_cgid(iter->ent);
|
|
|
|
if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
|
|
char blkcg_name_buf[NAME_MAX + 1] = "<...>";
|
|
|
|
cgroup_path_from_kernfs_id(id, blkcg_name_buf,
|
|
sizeof(blkcg_name_buf));
|
|
trace_seq_printf(&iter->seq, "%3d,%-3d %s %2s %3s ",
|
|
MAJOR(t->device), MINOR(t->device),
|
|
blkcg_name_buf, act, rwbs);
|
|
} else {
|
|
/*
|
|
* The cgid portion used to be "INO,GEN". Userland
|
|
* builds a FILEID_INO32_GEN fid out of them and
|
|
* opens the cgroup using open_by_handle_at(2).
|
|
* While 32bit ino setups are still the same, 64bit
|
|
* ones now use the 64bit ino as the whole ID and
|
|
* no longer use generation.
|
|
*
|
|
* Regardless of the content, always output
|
|
* "LOW32,HIGH32" so that FILEID_INO32_GEN fid can
|
|
* be mapped back to @id on both 64 and 32bit ino
|
|
* setups. See __kernfs_fh_to_dentry().
|
|
*/
|
|
trace_seq_printf(&iter->seq,
|
|
"%3d,%-3d %llx,%-llx %2s %3s ",
|
|
MAJOR(t->device), MINOR(t->device),
|
|
id & U32_MAX, id >> 32, act, rwbs);
|
|
}
|
|
} else
|
|
trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
|
|
MAJOR(t->device), MINOR(t->device), act, rwbs);
|
|
}
|
|
|
|
static void blk_log_dump_pdu(struct trace_seq *s,
|
|
const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
const unsigned char *pdu_buf;
|
|
int pdu_len;
|
|
int i, end;
|
|
|
|
pdu_buf = pdu_start(ent, has_cg);
|
|
pdu_len = pdu_real_len(ent, has_cg);
|
|
|
|
if (!pdu_len)
|
|
return;
|
|
|
|
/* find the last zero that needs to be printed */
|
|
for (end = pdu_len - 1; end >= 0; end--)
|
|
if (pdu_buf[end])
|
|
break;
|
|
end++;
|
|
|
|
trace_seq_putc(s, '(');
|
|
|
|
for (i = 0; i < pdu_len; i++) {
|
|
|
|
trace_seq_printf(s, "%s%02x",
|
|
i == 0 ? "" : " ", pdu_buf[i]);
|
|
|
|
/*
|
|
* stop when the rest is just zeros and indicate so
|
|
* with a ".." appended
|
|
*/
|
|
if (i == end && end != pdu_len - 1) {
|
|
trace_seq_puts(s, " ..) ");
|
|
return;
|
|
}
|
|
}
|
|
|
|
trace_seq_puts(s, ") ");
|
|
}
|
|
|
|
static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
char cmd[TASK_COMM_LEN];
|
|
|
|
trace_find_cmdline(ent->pid, cmd);
|
|
|
|
if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
|
|
trace_seq_printf(s, "%u ", t_bytes(ent));
|
|
blk_log_dump_pdu(s, ent, has_cg);
|
|
trace_seq_printf(s, "[%s]\n", cmd);
|
|
} else {
|
|
if (t_sec(ent))
|
|
trace_seq_printf(s, "%llu + %u [%s]\n",
|
|
t_sector(ent), t_sec(ent), cmd);
|
|
else
|
|
trace_seq_printf(s, "[%s]\n", cmd);
|
|
}
|
|
}
|
|
|
|
static void blk_log_with_error(struct trace_seq *s,
|
|
const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
|
|
blk_log_dump_pdu(s, ent, has_cg);
|
|
trace_seq_printf(s, "[%d]\n", t_error(ent));
|
|
} else {
|
|
if (t_sec(ent))
|
|
trace_seq_printf(s, "%llu + %u [%d]\n",
|
|
t_sector(ent),
|
|
t_sec(ent), t_error(ent));
|
|
else
|
|
trace_seq_printf(s, "%llu [%d]\n",
|
|
t_sector(ent), t_error(ent));
|
|
}
|
|
}
|
|
|
|
static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
const struct blk_io_trace_remap *__r = pdu_start(ent, has_cg);
|
|
|
|
trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
|
|
t_sector(ent), t_sec(ent),
|
|
MAJOR(be32_to_cpu(__r->device_from)),
|
|
MINOR(be32_to_cpu(__r->device_from)),
|
|
be64_to_cpu(__r->sector_from));
|
|
}
|
|
|
|
static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
char cmd[TASK_COMM_LEN];
|
|
|
|
trace_find_cmdline(ent->pid, cmd);
|
|
|
|
trace_seq_printf(s, "[%s]\n", cmd);
|
|
}
|
|
|
|
static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
char cmd[TASK_COMM_LEN];
|
|
|
|
trace_find_cmdline(ent->pid, cmd);
|
|
|
|
trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
|
|
}
|
|
|
|
static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
|
|
{
|
|
char cmd[TASK_COMM_LEN];
|
|
|
|
trace_find_cmdline(ent->pid, cmd);
|
|
|
|
trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
|
|
get_pdu_int(ent, has_cg), cmd);
|
|
}
|
|
|
|
static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
|
|
bool has_cg)
|
|
{
|
|
|
|
trace_seq_putmem(s, pdu_start(ent, has_cg),
|
|
pdu_real_len(ent, has_cg));
|
|
trace_seq_putc(s, '\n');
|
|
}
|
|
|
|
/*
|
|
* struct tracer operations
|
|
*/
|
|
|
|
static void blk_tracer_print_header(struct seq_file *m)
|
|
{
|
|
if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC))
|
|
return;
|
|
seq_puts(m, "# DEV CPU TIMESTAMP PID ACT FLG\n"
|
|
"# | | | | | |\n");
|
|
}
|
|
|
|
static void blk_tracer_start(struct trace_array *tr)
|
|
{
|
|
blk_tracer_enabled = true;
|
|
}
|
|
|
|
static int blk_tracer_init(struct trace_array *tr)
|
|
{
|
|
blk_tr = tr;
|
|
blk_tracer_start(tr);
|
|
return 0;
|
|
}
|
|
|
|
static void blk_tracer_stop(struct trace_array *tr)
|
|
{
|
|
blk_tracer_enabled = false;
|
|
}
|
|
|
|
static void blk_tracer_reset(struct trace_array *tr)
|
|
{
|
|
blk_tracer_stop(tr);
|
|
}
|
|
|
|
static const struct {
|
|
const char *act[2];
|
|
void (*print)(struct trace_seq *s, const struct trace_entry *ent,
|
|
bool has_cg);
|
|
} what2act[] = {
|
|
[__BLK_TA_QUEUE] = {{ "Q", "queue" }, blk_log_generic },
|
|
[__BLK_TA_BACKMERGE] = {{ "M", "backmerge" }, blk_log_generic },
|
|
[__BLK_TA_FRONTMERGE] = {{ "F", "frontmerge" }, blk_log_generic },
|
|
[__BLK_TA_GETRQ] = {{ "G", "getrq" }, blk_log_generic },
|
|
[__BLK_TA_SLEEPRQ] = {{ "S", "sleeprq" }, blk_log_generic },
|
|
[__BLK_TA_REQUEUE] = {{ "R", "requeue" }, blk_log_with_error },
|
|
[__BLK_TA_ISSUE] = {{ "D", "issue" }, blk_log_generic },
|
|
[__BLK_TA_COMPLETE] = {{ "C", "complete" }, blk_log_with_error },
|
|
[__BLK_TA_PLUG] = {{ "P", "plug" }, blk_log_plug },
|
|
[__BLK_TA_UNPLUG_IO] = {{ "U", "unplug_io" }, blk_log_unplug },
|
|
[__BLK_TA_UNPLUG_TIMER] = {{ "UT", "unplug_timer" }, blk_log_unplug },
|
|
[__BLK_TA_INSERT] = {{ "I", "insert" }, blk_log_generic },
|
|
[__BLK_TA_SPLIT] = {{ "X", "split" }, blk_log_split },
|
|
[__BLK_TA_REMAP] = {{ "A", "remap" }, blk_log_remap },
|
|
};
|
|
|
|
static enum print_line_t print_one_line(struct trace_iterator *iter,
|
|
bool classic)
|
|
{
|
|
struct trace_array *tr = iter->tr;
|
|
struct trace_seq *s = &iter->seq;
|
|
const struct blk_io_trace2 *t;
|
|
u16 what;
|
|
bool long_act;
|
|
blk_log_action_t *log_action;
|
|
bool has_cg;
|
|
|
|
t = te_blk_io_trace(iter->ent);
|
|
what = (t->action & ((1 << BLK_TC_SHIFT) - 1)) & ~__BLK_TA_CGROUP;
|
|
long_act = !!(tr->trace_flags & TRACE_ITER(VERBOSE));
|
|
log_action = classic ? &blk_log_action_classic : &blk_log_action;
|
|
has_cg = t->action & __BLK_TA_CGROUP;
|
|
|
|
if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
|
|
log_action(iter, long_act ? "message" : "m", has_cg);
|
|
blk_log_msg(s, iter->ent, has_cg);
|
|
return trace_handle_return(s);
|
|
}
|
|
|
|
if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act)))
|
|
trace_seq_printf(s, "Unknown action %x\n", what);
|
|
else {
|
|
log_action(iter, what2act[what].act[long_act], has_cg);
|
|
what2act[what].print(s, iter->ent, has_cg);
|
|
}
|
|
|
|
return trace_handle_return(s);
|
|
}
|
|
|
|
static enum print_line_t blk_trace_event_print(struct trace_iterator *iter,
|
|
int flags, struct trace_event *event)
|
|
{
|
|
return print_one_line(iter, false);
|
|
}
|
|
|
|
static void blk_trace_synthesize_old_trace(struct trace_iterator *iter)
|
|
{
|
|
struct trace_seq *s = &iter->seq;
|
|
struct blk_io_trace2 *t = (struct blk_io_trace2 *)iter->ent;
|
|
const int offset = offsetof(struct blk_io_trace2, sector);
|
|
struct blk_io_trace old = {
|
|
.magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION,
|
|
.time = iter->ts,
|
|
};
|
|
|
|
trace_seq_putmem(s, &old, offset);
|
|
trace_seq_putmem(s, &t->sector,
|
|
sizeof(old) - offset + t->pdu_len);
|
|
}
|
|
|
|
static enum print_line_t
|
|
blk_trace_event_print_binary(struct trace_iterator *iter, int flags,
|
|
struct trace_event *event)
|
|
{
|
|
blk_trace_synthesize_old_trace(iter);
|
|
|
|
return trace_handle_return(&iter->seq);
|
|
}
|
|
|
|
static enum print_line_t blk_tracer_print_line(struct trace_iterator *iter)
|
|
{
|
|
if ((iter->ent->type != TRACE_BLK) ||
|
|
!(blk_tracer_flags.val & TRACE_BLK_OPT_CLASSIC))
|
|
return TRACE_TYPE_UNHANDLED;
|
|
|
|
return print_one_line(iter, true);
|
|
}
|
|
|
|
static int
|
|
blk_tracer_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set)
|
|
{
|
|
/* don't output context-info for blk_classic output */
|
|
if (bit == TRACE_BLK_OPT_CLASSIC) {
|
|
if (set)
|
|
tr->trace_flags &= ~TRACE_ITER(CONTEXT_INFO);
|
|
else
|
|
tr->trace_flags |= TRACE_ITER(CONTEXT_INFO);
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
static struct tracer blk_tracer __read_mostly = {
|
|
.name = "blk",
|
|
.init = blk_tracer_init,
|
|
.reset = blk_tracer_reset,
|
|
.start = blk_tracer_start,
|
|
.stop = blk_tracer_stop,
|
|
.print_header = blk_tracer_print_header,
|
|
.print_line = blk_tracer_print_line,
|
|
.flags = &blk_tracer_flags,
|
|
.set_flag = blk_tracer_set_flag,
|
|
};
|
|
|
|
static struct trace_event_functions trace_blk_event_funcs = {
|
|
.trace = blk_trace_event_print,
|
|
.binary = blk_trace_event_print_binary,
|
|
};
|
|
|
|
static struct trace_event trace_blk_event = {
|
|
.type = TRACE_BLK,
|
|
.funcs = &trace_blk_event_funcs,
|
|
};
|
|
|
|
static struct work_struct blktrace_works __initdata;
|
|
|
|
static int __init __init_blk_tracer(void)
|
|
{
|
|
if (!register_trace_event(&trace_blk_event)) {
|
|
pr_warn("Warning: could not register block events\n");
|
|
return 1;
|
|
}
|
|
|
|
if (register_tracer(&blk_tracer) != 0) {
|
|
pr_warn("Warning: could not register the block tracer\n");
|
|
unregister_trace_event(&trace_blk_event);
|
|
return 1;
|
|
}
|
|
|
|
BUILD_BUG_ON(__alignof__(struct blk_user_trace_setup2) %
|
|
__alignof__(long));
|
|
BUILD_BUG_ON(__alignof__(struct blk_io_trace2) % __alignof__(long));
|
|
|
|
return 0;
|
|
}
|
|
|
|
static void __init blktrace_works_func(struct work_struct *work)
|
|
{
|
|
__init_blk_tracer();
|
|
}
|
|
|
|
static int __init init_blk_tracer(void)
|
|
{
|
|
int ret = 0;
|
|
|
|
if (trace_init_wq) {
|
|
INIT_WORK(&blktrace_works, blktrace_works_func);
|
|
queue_work(trace_init_wq, &blktrace_works);
|
|
} else {
|
|
ret = __init_blk_tracer();
|
|
}
|
|
|
|
return ret;
|
|
}
|
|
|
|
device_initcall(init_blk_tracer);
|
|
|
|
static int blk_trace_remove_queue(struct request_queue *q)
|
|
{
|
|
struct blk_trace *bt;
|
|
|
|
bt = rcu_replace_pointer(q->blk_trace, NULL,
|
|
lockdep_is_held(&q->debugfs_mutex));
|
|
if (bt == NULL)
|
|
return -EINVAL;
|
|
|
|
blk_trace_stop(bt);
|
|
|
|
put_probe_ref();
|
|
synchronize_rcu();
|
|
blk_trace_free(q, bt);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* Setup everything required to start tracing
|
|
*/
|
|
static int blk_trace_setup_queue(struct request_queue *q,
|
|
struct block_device *bdev)
|
|
{
|
|
struct blk_trace *bt = NULL;
|
|
int ret = -ENOMEM;
|
|
|
|
bt = kzalloc(sizeof(*bt), GFP_KERNEL);
|
|
if (!bt)
|
|
return -ENOMEM;
|
|
|
|
bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG, __alignof__(char));
|
|
if (!bt->msg_data)
|
|
goto free_bt;
|
|
|
|
bt->dev = bdev->bd_dev;
|
|
bt->act_mask = (u16)-1;
|
|
|
|
blk_trace_setup_lba(bt, bdev);
|
|
|
|
rcu_assign_pointer(q->blk_trace, bt);
|
|
get_probe_ref();
|
|
return 0;
|
|
|
|
free_bt:
|
|
blk_trace_free(q, bt);
|
|
return ret;
|
|
}
|
|
|
|
/*
|
|
* sysfs interface to enable and configure tracing
|
|
*/
|
|
|
|
static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
|
|
struct device_attribute *attr,
|
|
char *buf);
|
|
static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
|
|
struct device_attribute *attr,
|
|
const char *buf, size_t count);
|
|
#define BLK_TRACE_DEVICE_ATTR(_name) \
|
|
DEVICE_ATTR(_name, S_IRUGO | S_IWUSR, \
|
|
sysfs_blk_trace_attr_show, \
|
|
sysfs_blk_trace_attr_store)
|
|
|
|
static BLK_TRACE_DEVICE_ATTR(enable);
|
|
static BLK_TRACE_DEVICE_ATTR(act_mask);
|
|
static BLK_TRACE_DEVICE_ATTR(pid);
|
|
static BLK_TRACE_DEVICE_ATTR(start_lba);
|
|
static BLK_TRACE_DEVICE_ATTR(end_lba);
|
|
|
|
static struct attribute *blk_trace_attrs[] = {
|
|
&dev_attr_enable.attr,
|
|
&dev_attr_act_mask.attr,
|
|
&dev_attr_pid.attr,
|
|
&dev_attr_start_lba.attr,
|
|
&dev_attr_end_lba.attr,
|
|
NULL
|
|
};
|
|
|
|
struct attribute_group blk_trace_attr_group = {
|
|
.name = "trace",
|
|
.attrs = blk_trace_attrs,
|
|
};
|
|
|
|
static const struct {
|
|
int mask;
|
|
const char *str;
|
|
} mask_maps[] = {
|
|
{ BLK_TC_READ, "read" },
|
|
{ BLK_TC_WRITE, "write" },
|
|
{ BLK_TC_FLUSH, "flush" },
|
|
{ BLK_TC_SYNC, "sync" },
|
|
{ BLK_TC_QUEUE, "queue" },
|
|
{ BLK_TC_REQUEUE, "requeue" },
|
|
{ BLK_TC_ISSUE, "issue" },
|
|
{ BLK_TC_COMPLETE, "complete" },
|
|
{ BLK_TC_FS, "fs" },
|
|
{ BLK_TC_PC, "pc" },
|
|
{ BLK_TC_NOTIFY, "notify" },
|
|
{ BLK_TC_AHEAD, "ahead" },
|
|
{ BLK_TC_META, "meta" },
|
|
{ BLK_TC_DISCARD, "discard" },
|
|
{ BLK_TC_DRV_DATA, "drv_data" },
|
|
{ BLK_TC_FUA, "fua" },
|
|
{ BLK_TC_WRITE_ZEROES, "write-zeroes" },
|
|
};
|
|
|
|
static int blk_trace_str2mask(const char *str)
|
|
{
|
|
int i;
|
|
int mask = 0;
|
|
char *buf, *s, *token;
|
|
|
|
buf = kstrdup(str, GFP_KERNEL);
|
|
if (buf == NULL)
|
|
return -ENOMEM;
|
|
s = strstrip(buf);
|
|
|
|
while (1) {
|
|
token = strsep(&s, ",");
|
|
if (token == NULL)
|
|
break;
|
|
|
|
if (*token == '\0')
|
|
continue;
|
|
|
|
for (i = 0; i < ARRAY_SIZE(mask_maps); i++) {
|
|
if (strcasecmp(token, mask_maps[i].str) == 0) {
|
|
mask |= mask_maps[i].mask;
|
|
break;
|
|
}
|
|
}
|
|
if (i == ARRAY_SIZE(mask_maps)) {
|
|
mask = -EINVAL;
|
|
break;
|
|
}
|
|
}
|
|
kfree(buf);
|
|
|
|
return mask;
|
|
}
|
|
|
|
static ssize_t blk_trace_mask2str(char *buf, int mask)
|
|
{
|
|
int i;
|
|
char *p = buf;
|
|
|
|
for (i = 0; i < ARRAY_SIZE(mask_maps); i++) {
|
|
if (mask & mask_maps[i].mask) {
|
|
p += sprintf(p, "%s%s",
|
|
(p == buf) ? "" : ",", mask_maps[i].str);
|
|
}
|
|
}
|
|
*p++ = '\n';
|
|
|
|
return p - buf;
|
|
}
|
|
|
|
static ssize_t sysfs_blk_trace_attr_show(struct device *dev,
|
|
struct device_attribute *attr,
|
|
char *buf)
|
|
{
|
|
struct block_device *bdev = dev_to_bdev(dev);
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
struct blk_trace *bt;
|
|
ssize_t ret = -ENXIO;
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
|
|
bt = rcu_dereference_protected(q->blk_trace,
|
|
lockdep_is_held(&q->debugfs_mutex));
|
|
if (attr == &dev_attr_enable) {
|
|
ret = sprintf(buf, "%u\n", !!bt);
|
|
goto out_unlock_bdev;
|
|
}
|
|
|
|
if (bt == NULL)
|
|
ret = sprintf(buf, "disabled\n");
|
|
else if (attr == &dev_attr_act_mask)
|
|
ret = blk_trace_mask2str(buf, bt->act_mask);
|
|
else if (attr == &dev_attr_pid)
|
|
ret = sprintf(buf, "%u\n", bt->pid);
|
|
else if (attr == &dev_attr_start_lba)
|
|
ret = sprintf(buf, "%llu\n", bt->start_lba);
|
|
else if (attr == &dev_attr_end_lba)
|
|
ret = sprintf(buf, "%llu\n", bt->end_lba);
|
|
|
|
out_unlock_bdev:
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
return ret;
|
|
}
|
|
|
|
static ssize_t sysfs_blk_trace_attr_store(struct device *dev,
|
|
struct device_attribute *attr,
|
|
const char *buf, size_t count)
|
|
{
|
|
struct block_device *bdev = dev_to_bdev(dev);
|
|
struct request_queue *q = bdev_get_queue(bdev);
|
|
struct blk_trace *bt;
|
|
u64 value;
|
|
ssize_t ret = -EINVAL;
|
|
|
|
if (count == 0)
|
|
goto out;
|
|
|
|
if (attr == &dev_attr_act_mask) {
|
|
if (kstrtoull(buf, 0, &value)) {
|
|
/* Assume it is a list of trace category names */
|
|
ret = blk_trace_str2mask(buf);
|
|
if (ret < 0)
|
|
goto out;
|
|
value = ret;
|
|
}
|
|
} else {
|
|
if (kstrtoull(buf, 0, &value))
|
|
goto out;
|
|
}
|
|
|
|
mutex_lock(&q->debugfs_mutex);
|
|
|
|
bt = rcu_dereference_protected(q->blk_trace,
|
|
lockdep_is_held(&q->debugfs_mutex));
|
|
if (attr == &dev_attr_enable) {
|
|
if (!!value == !!bt) {
|
|
ret = 0;
|
|
goto out_unlock_bdev;
|
|
}
|
|
if (value)
|
|
ret = blk_trace_setup_queue(q, bdev);
|
|
else
|
|
ret = blk_trace_remove_queue(q);
|
|
goto out_unlock_bdev;
|
|
}
|
|
|
|
ret = 0;
|
|
if (bt == NULL) {
|
|
ret = blk_trace_setup_queue(q, bdev);
|
|
bt = rcu_dereference_protected(q->blk_trace,
|
|
lockdep_is_held(&q->debugfs_mutex));
|
|
}
|
|
|
|
if (ret == 0) {
|
|
if (attr == &dev_attr_act_mask)
|
|
bt->act_mask = value;
|
|
else if (attr == &dev_attr_pid)
|
|
bt->pid = value;
|
|
else if (attr == &dev_attr_start_lba)
|
|
bt->start_lba = value;
|
|
else if (attr == &dev_attr_end_lba)
|
|
bt->end_lba = value;
|
|
}
|
|
|
|
out_unlock_bdev:
|
|
mutex_unlock(&q->debugfs_mutex);
|
|
out:
|
|
return ret ? ret : count;
|
|
}
|
|
#endif /* CONFIG_BLK_DEV_IO_TRACE */
|
|
|
|
#ifdef CONFIG_EVENT_TRACING
|
|
|
|
/**
|
|
* blk_fill_rwbs - Fill the buffer rwbs by mapping op to character string.
|
|
* @rwbs: buffer to be filled
|
|
* @opf: request operation type (REQ_OP_XXX) and flags for the tracepoint
|
|
*
|
|
* Description:
|
|
* Maps each request operation and flag to a single character and fills the
|
|
* buffer provided by the caller with resulting string.
|
|
*
|
|
**/
|
|
void blk_fill_rwbs(char *rwbs, blk_opf_t opf)
|
|
{
|
|
int i = 0;
|
|
|
|
if (opf & REQ_PREFLUSH)
|
|
rwbs[i++] = 'F';
|
|
|
|
switch (opf & REQ_OP_MASK) {
|
|
case REQ_OP_WRITE:
|
|
rwbs[i++] = 'W';
|
|
break;
|
|
case REQ_OP_DISCARD:
|
|
rwbs[i++] = 'D';
|
|
break;
|
|
case REQ_OP_SECURE_ERASE:
|
|
rwbs[i++] = 'D';
|
|
rwbs[i++] = 'E';
|
|
break;
|
|
case REQ_OP_FLUSH:
|
|
rwbs[i++] = 'F';
|
|
break;
|
|
case REQ_OP_READ:
|
|
rwbs[i++] = 'R';
|
|
break;
|
|
case REQ_OP_ZONE_APPEND:
|
|
rwbs[i++] = 'Z';
|
|
rwbs[i++] = 'A';
|
|
break;
|
|
case REQ_OP_ZONE_RESET:
|
|
case REQ_OP_ZONE_RESET_ALL:
|
|
rwbs[i++] = 'Z';
|
|
rwbs[i++] = 'R';
|
|
if ((opf & REQ_OP_MASK) == REQ_OP_ZONE_RESET_ALL)
|
|
rwbs[i++] = 'A';
|
|
break;
|
|
case REQ_OP_ZONE_FINISH:
|
|
rwbs[i++] = 'Z';
|
|
rwbs[i++] = 'F';
|
|
break;
|
|
case REQ_OP_ZONE_OPEN:
|
|
rwbs[i++] = 'Z';
|
|
rwbs[i++] = 'O';
|
|
break;
|
|
case REQ_OP_ZONE_CLOSE:
|
|
rwbs[i++] = 'Z';
|
|
rwbs[i++] = 'C';
|
|
break;
|
|
case REQ_OP_WRITE_ZEROES:
|
|
rwbs[i++] = 'W';
|
|
rwbs[i++] = 'Z';
|
|
break;
|
|
default:
|
|
rwbs[i++] = 'N';
|
|
}
|
|
|
|
if (opf & REQ_FUA)
|
|
rwbs[i++] = 'F';
|
|
if (opf & REQ_RAHEAD)
|
|
rwbs[i++] = 'A';
|
|
if (opf & REQ_SYNC)
|
|
rwbs[i++] = 'S';
|
|
if (opf & REQ_META)
|
|
rwbs[i++] = 'M';
|
|
if (opf & REQ_ATOMIC)
|
|
rwbs[i++] = 'U';
|
|
|
|
WARN_ON_ONCE(i >= RWBS_LEN);
|
|
|
|
rwbs[i] = '\0';
|
|
}
|
|
EXPORT_SYMBOL_GPL(blk_fill_rwbs);
|
|
|
|
#endif /* CONFIG_EVENT_TRACING */
|
|
|